Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thank you for taking your time for the presentation
of enriching the data versus filtering the data in
Apache's part. I'm Gokul Prabagaren, engineering manager and
Carl loyalty organization Capital. So before
we really dive into today's topic of
our interest, enriching versus filtering, I would like
to give some details about capital loan to capital
loan is the first US bank to exit out of legacy
on premise data centers to go all in cloud.
You can imagine what kind of a tech transformation
a public company would have gone through to really achieve
such a feat. So that is why we are
a tech company currently which happened to be in a banking business.
We have invested heavily into our tech capabilities
and we pretty much operate now as a tech company.
And these are all things possible mainly because
we are a founder led company till date, staying true
to its mission of change banking for good. How we
really stay focused to that mission,
there are many ways we do. One of that is we give back to our
community and in that also we do multiple things.
This being a tech conference and also we are a tech focus
organization. I would like to start off with we
not only operate as an open source first company,
we also contribute really into open
source projects as well as from our
implementations within our enterprise organization for
our financial services company. There are many
things we do in the regulated industry which
can benefit others. So we also give
back lot of those things as open source projects.
And there are many things we formed as open
source project which came from our organization and I have
called out few which is like critical Stack,
Rubicon, Data profiler, data compi, cloud custodian.
They all play in various spaces like
DevOps, Kubernetes, data yaml,
data cleansing and lot of things. So that's
open source for you guys. The next one is coders.
Coders is a program we run in middle schools across United
States where our work with middle
school students and provide them
opportunities to envision a tech career in their future and
also get really hands on experience for them while they
are in middle school itself. The next one is Quoda.
This is the program which paves way for non tech
folks to get into tech stream and tech
as their career. And we provide and empower them with
the opportunities for them to be really successful with
tech. So first we will start off in our agenda.
First we are going to really see the loyalty use causes in capital.
When we really get into the details, you will understand why
this is what is our starting point,
right? Because this is where this
whole design pattern evolved
and that design pattern first starts with filtering the
data approach in Apache Spark. And there were
some challenges we faced with that approach and
that is what led us to a different approach which we
tried out and now we are sharing this for all
our folks which is enriching the data approach
and how those challenges were fixed using enriching the
data approach. First we will start off with the theory on what
this really means using our use case.
Then we will go through a databased
approach for the same which enables us
to compare two different approaches and why we
pivoted towards one over the other. And we believe that
that probably something which will help lot of.
Finally we will conclude and leave some time for the Q A.
So first we are going to start off with loyalty
use case in Capitalone. So loyalty use case
if you have used any of capitalone product,
Capitalone being a pioneer in credit card and this
platform is the one which pretty much any of our credit card products,
be it Venturex or Venture, focusing on
travel or saver one, dining and entertainment,
all those transactions gets processed through this platform.
And this is the platform which is one of the core credit card
rewards based out of Spark. I have abstracted the details
for brewty, but if you see we
receive all those credit card transactions and we process them by reading
them and then we apply a lot of account specific
eligibility rules. And then transaction specific eligibility
rules. Finally we compute earn and then we persist them in the database.
So you can imagine like account specific business rules
probably maybe to the simplest term, like hey, is this account is
really a valid account to get rewards at this
point in time? And the same goes for transaction too,
right? There are various things we do, but this is the simplest one where
a transaction is this really eligible transaction
to earn rewards. So those are
business rules we apply and then we compute the
earn and that's what gets persisted in the data store.
The reason now you can imagine, right, why you are starting off with
this. This is the platform which is built on top of
Apache Spark which processes millions of transactions.
And this is our brewing ground for comparing these two
design patterns in Apache Spark. Right? So let's start.
So first we are going to start off with filtering
the data approach. So just keep the previous picture
in your mind and we are going to stretch that same example
into a little bit deeper on
using filtering the data approach lens,
the same use case.
Now what we are really doing with filtering the
data approach in this we are receiving some transactions
when we are applying some account specific business rules,
then transaction specific business rules, how are we doing this,
the key thing to remember when it comes to filtering the
data approaches, it is done using Spark's
inner join. And Spark being one of the
big data framework specialist using in memory.
So filtering the data approaches also do the same,
which is inner join using.
So with that context, let's see this. So even
though we do operate with millions of transactions,
just for establishing the fact, like let's take
ten transaction is what we are dealing with in this example. So when
we are reading those transactions and then we are applying some
account specific business rules using Sparks inner join,
what happens naturally is it is actually
going to filtered out the ones which is not matching,
right? So transactions which are got matching in this example,
phi are gone. So Phi goes to the next stage. And same
we are doing with the transaction specific business rules where we are
applying sparks inner join in memory which
filters out three of the transaction which is
not matching. So then finally only two are making it towards
computation. And let's assume true are really
eligible transactions and we are computing earn
for those. So the key takeaway from
this is we are using sparks inner join
in memory at each stage of data pipeline.
What happens is the non matching ones naturally getting filtered
out at each stage and only the ones which
are eligible or which are matching is what really making
it to the end. So if you see this now
filtering data approach may imagine, hey,
this sounds like really what spark does,
right? Let's stretch this same example,
establish the fact using data the same.
Now we are having, we have three
transactions and we are trying to really do spark
inner join to find the matching ones out of this.
So what are the things which are matching? So we naturally
are getting only two transactions which are matching when
we are trying to apply the account status of good,
which is which indirectly means that they're good
accounts, right? So the same if we do with the transaction eligibility,
assume that it's the category is what we are trying to see,
which is we don't want to deal with any other payment.
So in that case, we are trying to filter out the payment
and we are only processing the accounts which have made the
purchase. So that is leaving us with one
transaction in this example. So that is what
goes to our earned competition. And then we really give
them all the rewards, whatever they are entitled to.
So if you see this flow, it's the same spark inner
join in memory which causes one in this case.
So what's the problem with this approach? When we
put this in production, there has been some challenges we really faced.
First thing is after having the application deployed production,
it was really hard to debug this. The main thing
is, hey, what has happened to those transaction
which got filtered in memory? Because it's something which
happens at that moment and everything is dealt
in memory for the fastness of Apache spark
framework. But if we have to really
backtracing all those transactions which has happened in
memory, that's where the real challenge starts.
And being in a regulated industry,
we really need to know mainly what
has happened to the transactions which we really did not
provide earned or we should be in a production for anyone for that matter.
Right? You should be in a position to know that what has happened to each
one of your records which
if something getting filtered out in memory. In some use cases,
it probably fits. In our use case, it did not.
People who are familiar with apologies park probably may argue that
hey, you can do counts at each stage. Yes, that's possible.
But there are two issues with that approach too, which is a
costly operation in Apache Spark. Data pipelines probably
can live with doing counts. But if your processing is
huge enough, then you probably may not be afforded to do counts
at each stage when you are dealing with millions and billions of
rows. And the second problem with the count approach is
you will again get to know how many records really
got filtered out or made it to the next stage. You will
never get to know why unless you really know the
context. So these are all the challenges we faced
with the filtering the data approach. How did we really overcome this
problem? Right, that's where we pivoted.
After doing some research, we pivoted towards our
next design pattern which is enriching the data
approach. The same example, if you see here the same example
of dealing with ten transaction, the key
difference is instead of sparks inner join,
we are going to use sparks left outer join. In this case,
the main thing is we are not really filtering out
any data at any stage. So in previous
case, you started off with ten. You filter out the information,
so which means that your number of rows decreases.
But in this case, we are got filtering out. We are really
enriching the data with all the contextual
information from your left data set.
Keep enhancing your right data set so that you have all
the information so the rows are not changing.
Instead, your columns are growing. So the same
example, ten transactions. We are applying five account
specific rules. Nothing changes. But we have really
picked up some columns which are required for us
to determine later. So ten rows, again making it to the
transaction stage, we are applying transaction
rules. The same ten transaction stays. We picked
up some transaction specific business rules. Now we
apply all the business logic. Then we are actually
arranging with the same result good accounts which
may purchase right two transactions.
Then that's what really is pushed to the next stage and
they are getting whatever they are entitled to. So the key difference in
this is the rows are not changing because we
are enriching the columns using left outer join.
Let's drive home the fact using a data example for
enriching as well, similar to what we did with the filtered,
we are going to see the same example. So here
the same accounts and transactions,
the left outer join number of rows
are not changing. Instead, we really enriched
the data set with the status which is what we need to make a
determination whether that was a good
account. Right same we are doing
with the transaction left outer join.
So our column got increased where we picked
up the category into our data set.
Now we have same three transactions with few more
columns. And what we are doing is we are using
this information and trying to apply some business logic.
Then we can enhance with some more columns as
well to make really your computation more
easier. So what we are doing is, hey, is this really eligible
for next stage? So just have them as true. Then you can
go and pick whatever that particular column is true for your
next stage of processing. So in this, that's the only
one which has good account and
also they have made the purchase, right? So that's
what really we are getting the same result
similar to our previous example too. What's the advantage
of enriching over the filtering?
So if you compare the problems we faced with the filtering,
with enriching here the data
is not changing. Instead we are really enriching the original data
state which captures the state information which makes it easy
for us to debug and analyze later. And also
we have the data columns and flags captured at each
stage. Gives us more granular details to debug
as well as backtracing. Hey, what has happened to the other two
transaction in our example? Why they did not make it to the next
stage? This naturally enables
us no need of having counts at any stage because
we have all the required information for us to really do.
Cool. You probably all got some comparison
details between two Apache Spark
based design patterns and you
probably may be able to make some informed decisions.
Which one fits your use case, right? We really
made the switch to enriching the data approach in our
production after we went live with the first version
using filtering in initial days itself. And that
filtering approach is what really is successfully
running in production and processing millions
of credit card transactions daily and that's what
really provides all our customers with millions
of cash card and miles. Hope you all
had some details about me. I'm a capital
engineering manager. I have been building software application
from its initial version of Java so as Apache Spark
I regularly give presentations based on
big data NoSQL as well as contribute
to capital tech blogs. I have provided
my social handles for you guys to thank you for the opportunity
watching your valet. Hope you all have great conference.
Thank you.