Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi everyone,
thanks for joining. We'll be talking
on building a data platform
to process over 50 billion in card transactions.
I'm Sandeep, I'm engineering lead data platforms. I'm from Dojo.
If you don't know about us,
we are the fastest growing fintech in Europe by net revenue.
We power face to face payments for around 15% of
card transaction in the UK every day. We are mainly focused on
experience economy of bars, pubs,
restaurants, even your local corner shop or farmers market. We also
have a consumer facing app that allows you to join the queue for
a high street restaurants for up to around 2 miles
radius. We have an amazing line
of payment products. We have tap to pay pocket pay.
We also, sorry, provide the small
business funding. And we are also building many, many products
to change how the payment industry works.
We are going international soon. Hopefully you will see us in
every corner of Europe first and then in the world.
What makes us different,
I guess, and also makes all of this possible is the fact
that we are highly, highly data driven and value the insights
in our decision making. When you go
to a shop and when you tap your card and within a blink
of a second your transaction happens there and
you get the goods and the customer, the merchant,
basically for us it's the customer, but the merchant who is selling you the
goods or selling you the services get the money.
All of this happens within a blink of a second.
As we just talked about, when you tap a card,
the first thing happens is the card machine takes the details from the card
and securely sends these to the authorization gateway. At Dojo,
we use point to point encryption between the card machine and our authorization
gateway with hardware security modules in multiple facilities
with direct pairing to multiple cloud providers.
It's highly, highly secure and highly, highly scalable. We can't let
our authorization gateway down because that's the main point
from where we take the transactions. The authorization
gateway then contacts the card network such as Visa,
Mastercard, Amex, and then forwards this request to
the issuing bank for approval. This will be
someone like your bank in UK Barclays card,
Monzo, Tesco, who will then freeze the funds on the customer's account
in support. It's very important to note that the
money, the actual money has
not exchanged yet, and it is just a promise
for a payment that can be reversed as well.
The card network then sends this approval back to the authorization gateway,
which is then displayed to the customer as approved or declined.
And this is how your card transactions actually
works. And this whole process generates a
lot of data from a payment point of view,
because we are a payment company. So we
have a couple of regulatory challenges where we
own the end to end payments experience. So we operate under e money license
from the FCA, which is a financial conduct authority.
With this comes strict regulatory requirements,
most notably the fact that we have to ensure the whole time
that customer funds are safeguarded in case the business becomes irrelevant
or insolvent.
Then you have PCI DSS level one
compliance, which is around safeguarding or not
safeguarding, storing of full card numbers. So owning
the whole payment stacks with full card numbers. We also have to comply
with PCI DSS level one. And we also are
independently audited every year to ensure that we
actually follow all the files and guidelines provided
by PCI. Then we have other complexities
around schemas. So we process,
I guess more than 1000 plus data contracts or
schemas. And this talk, sorry, I completely
forgot to mention this talk will be around file
processing. So I'm going to just talk about how we
built the processing of these
files coming from different, different sources and all these
schemes, Visa, Mastercard, and all the payment
boxes. We have so
1000 schemas, as I said. And then you have multiple file sizes
and multiple formats. So it's just not
like you have standard file size. Okay, we're going to have a file
size range from five mb to ten mb. It doesn't work like that.
We have files which are two kb, ten kb,
50 kb, I don't know. And then it goes up to gigabytes,
and then we also get like zipped files as well. Then you
have a scalability, you can have unpredictable demand.
Sometimes we have, I don't know, 500,000 of
files coming right now because of
slas between multiple sources, they overlap and all of
the file comes at the same time. And then all of
those files sometimes are business critical. So we have to process all of them at
the same time. So for example,
if you see this is a snapshot
of internal reconciliation process performed by our
payments analytical engineering team, just to ensure
that a merchant's net settlement tallies with card transactions that
have been actually authorized on the tills in their shops.
As you can see, there's like CSVs XLS XML
that many files required to get just that done.
And also another bit like all
the files coming in doesn't just have XLS or XML,
they have all the files formats possible.
We have JSON files, we even have Avro
files. Now we have parquet files
as well. We have fix
with files. We have done this proprietary formats
created by this scheme,
companies like Visa, Mastercard. For that you had to write
very heavy custom parsers to actually make sense out of
those files and put them into your transactional kind
of data warehouse.
And the goal
which we came up, or the goal
which we thought would be good for us to actually take
all of these file formats, process them into
a final single format which can be then later utilized
by processes downstreams to stream that into
warehouse, or stream that into Kafka, or stream that into snowflake,
or stream that into anywhere else, wherever we want.
And we chose Avro because of most of the powers
around scheme evolution and mainly
that I guess. But there now there are multiple different formats
which you can choose. But we chose Avro at that point.
And it's not just payments. There are a lot of other business areas in the
company which are important. They produce a lot of events data and
also generate a lot of files data. And that
can be coming from APIs or that can be coming from external tools or
that can be generated by their own microservices.
So to support the processing of files with all the schema evolution,
with the scale and with all these challenges in mind, it was very
important for us to design a data platform which is scalable
and self serve. And now before deep diving into
the modern data infracessing and how
we have done it, I would like to take us back into the
history of data processing.
There were three generation of data processing.
First generation was the generation of enterprise data warehouse
solutions. Big giants like Oracle, IBM, Microsoft were
building those big big warehouse which only few people
know how to use. Then came the era
of big data ecosystem, the era of Hadoop, the era
of pig hide spark came into the picture.
We built those monolith big, big centralized,
beefy hadoops, big data platforms which was again
only utilized or monitored or operated by
few people. And huge bottleneck and huge
number of skill shortage to actually make the
huge, to make the use of it, I would say.
Then we talk about current generation, mostly centralized data
platforms. Mix of batch and real time processing based on tools like
Kafka, Apache Beam also gives you the flavor of
the mix. Real time and batch processing, then pubsub,
then red panda, then all sort of cloud
managed services like AWS, GCP, other cloud
providers like confluent, Avon, they are giving
their own managed services on top. Well, this centralized model
can work for organization that have a simpler domain with
a smaller number of diverse conception cases. It files for us
because we have rich domains, a large number of sources,
a large number of consumers,
and most of the companies who are going
through such a growth, they also have this problem and
this centralized data platform really doesn't work. And there are other challenges
with it. The challenge that the centralized data platform
is mainly owned by a centralized data team which
is focused on building, maintaining data
processing pipelines, then building data contracts working hand in
hand with stakeholders. But at the end of the day, there's no clear ownership on
that. Then issues support for
all the domains without actually having the domain
knowledge. Then you have silos, then specialized data team
which will keep on adding features if they get some time away from
the support issues or everyday change requests.
Then another issue is data
quality, accountability, democratization of data.
It's very difficult to enhance or put measures for data quality
if there is no clear ownership on the data. The fact that
data team manages data access, it makes it a bottleneck when
it comes to access request. Then scalability,
adding new data sources, increased data volumes, data contract changes,
et cetera, can be delayed due to a huge data team backlog
because they are the one who are actually
managing and maintaining the data pipelines. And the byproduct of
this also is not such a great relationship between
data team and the other teams, and also a lot of blame game when this
goes wrong. And over the last decade or
so, we have successfully applied the domain driven design into
our engineering or operational side.
But as a whole data community, we completely, or as a whole engineering
community, we completely forgot to put that into the data side.
Now, what should we do in this kind of scenario?
Right? How should we go on? And what is the right way of building
that file processing platform which we built at dojo, or any
kind of data platform which you might want to build it or anybody
wants to build it. Now imagine this,
right? What if the centralized data team will
only focus on creating a generic data infrastructure, building self
served data platforms by abstracting away all the technical complexities
and enabling other teams to easily process their own data.
Apart from that, they will also provide a global governance model and
policies to have better access management, better naming conventions,
better cis, better security policies,
and at the same time, domain ownership moves,
sorry, data ownership moves from centralized team
to the domains. I have said domains many times in this
chat so far, but by domains I mean teams that are working
on a certain business area, for example, payment, marketing,
customer, et cetera, et cetera.
Now, this whole approach brings a set of benefits.
Finally, the data is owned by people who understand
it better company wise. You reach a better
scalability by distributing data processing across the domains.
Domains are now independent and they are able to add new data sources,
create new data pipelines, fix the issues by themselves.
Domains can put better data quality and reliability measures than
the centralized data team because they understand it better.
Domains have the flexibility of using only what they need from
the generic data infrastructure and self serve data platform features,
which reduces the complexity blast radius if things goes
bad and things always goes bad. And having
said that, the centralized data team should
make sure that there is a good monitoring
alerting in place to monitor all
the flavors or features of data platform across
the domains. Now this approach will also
enforce well documented data contracts data API,
which will allow data to flow seamlessly between one system and the another
system. It can be internal or external as well.
And having domains owning their data infrastructure brings more visibility on
resource allocation and utilization, which could
lead into cost efficiency.
Now this whole concept is, my friend is data mesh.
Now data Mesh is, I know I'm talking
a lot of theory right now, but I will come back how
we build that dojo. But this is very important. So data mesh
is like domain ownership. It's one of the best,
or sorry, not one of the best. It's one of the important pillar.
It's your data, you own it,
you manage it, and if there is a problem with it, you fix it.
Data as a product, data is not a byproduct.
Treat your data as a product, whether it's
a file, whether it's an event, whether it's a warehouse, or whether
it's a beefy table, all of that treat it as a product.
Federated computational governance each domain can set
their own policies or rules over data. For example,
payment domain sftps encryption standards on payment data or marketing
domain sftps retention on the customer data to 28 days to
adhere to GDPR standards. Then the
final and my favorite bit is self serve data platform.
This is the key in all of this. If you successfully build
this, which means self serve data platform,
you've got 80% things right. This will enable teams to own
and manage their data and data processing. But the big question
is how you're going to build it. It looks
easy now, but when you have so many choices and
so many people pulling you in conferences that their data platform is the best,
they have the one stop solution. It can be quite overwhelming,
but I guess based on your use case you will do certain
PoCs. You would try to try to
look at probably open source solutions
available before buying already, I don't know,
committing to thousands and thousands of pounds or thousands and thousands of dollars
to some managed providers claiming that they can
solve all your problems. When we
were building this file processing platform, we were quite sure that
the platform, as a service or self serve platform offering was the
only way to move forward. Now on
a high level, four main features in our platform provides this end
to end solution for file processing, which we just talked about before,
that we have 20 plus different types of
files coming in, and then we have around 500k
files every day coming in, which we have to process varies in different size,
and then they come in in certain hours and we have to scale the system
accordingly as well. Now, four components.
First component is source connectors, which is ingesting data from
external providers in a consistent way. So we
did build those connectors to bring the data. Then you have PCI
processing platform, which actually makes sure
that a clear credit card information is stored,
masked and processed successfully and very,
very securely, and then being sent to non PCI platform,
and the non PCI platform which takes all the non PCI data
and also the data coming from all these other domains and
everywhere and perform schema
evaluation. I always struggle with this word and data
validation and generate outputs into that final format which
we agreed before Avro and basically chunked Avro
because the file size has been different,
the source file size. So we make sure that the final avro files
are chunked into somewhere around roughly to the same
size. So we don't end up having like one avro file which is two gig,
and one avro file which is like few kb's.
Sorry. Then target
connectors. Streaming the data generated
avro files, let's say from this lake
house kind of, or distributed data lakes into the data warehouse
or into any other streaming system or into
any other external kind of like warehouse
if Snowflake, or loading that
into mlops platform and things like that.
Now let's deep dive into all of these components.
Connectors, source connectors. We have a number of
different data sources. We have storage buckets, we have external
APIs, we have webhooks, we have Gmail attachments.
Trust me, we still have processes where we have to get the data from attachments.
Then we have external sftps servers. So mainly we use
Arclone to move most
of the data which is coming in files. It's an amazing
open source utility, which comes very handy
when you want to move files between two storage systems. You can move
from s three to gcs
or SFTP to any kind of storage bucket,
and it works like a charm. But you
have to spend some time on the configuration part of it.
Then you have Webex. So if the data
was not in files, for example in case of webhooks we batch those events into
files to keep things consistent. And we
did have some strict lsas slas but we
did not have slas to process this information in real time. So in
those cases where we don't have adhere kind of like
slas, okay, we need to process this information straight away.
We also use webhook kind of connectors
which batch those events and send them into the files.
I guess we are moving that completely into the overstreaming side now. I guess that
was the historic decision which we took. Then we have serverless
functions which allows the data to be received automatically by
email, including email body. I just talked about the email attachments
up there. We have one provider that does not attach a file
but provides a system configuration message in the body of the email.
So we had to write a parser for that as well.
I know they always have use cases like that
and then we use these connectors to land the data in the PCI and non
PCI platforms like depending on the sensitivity of the data.
So if we know that these files
are encrypted and they are supposed to
be processed by PCI to get the credit card
information masked, they directly go to the PCI platform and if
they are not they then bypass that process and directly go to
the non PCI platform. Before jumping into
the PCI platform, just few
lines on what is PCI compliance.
So all the listeners would understand how
much complexity is or how much things you have
to consider while building a PCI data platform kind
of environment.
Adhering to PCI standard is one of our prime
concerns given we own the end to end payment stack
and these standards are the set of security requirements established by the PCI
SSE to ensure that credit card information
is process successfully within any organization. Some of the key points
from this compliance are at a high level. They are like all
the credit card data has to be transmitted through secure encrypted
channels and then clear card
numbers cannot be
stored anywhere unless they are anonymized or encrypted.
Then you have the data platform that the platform which deals with the PCI
data has to be audited for
any security concerns every year.
So when we have PCI, so this is our PCI
management process within the PCI platform,
we have those files coming in, we don't know the
size of the files, they are zipped, they are encrypted.
So we have to decrypt the source file in memory using
confidential computing nodes. Then we encrypt with our
own key and archive the files for future purposes and for safeguarding.
Then if it's a zip file, we unzip on the disk. If the file
contains pans, we open the file, we mask the pans, then send the
masked version of the file to the non PCI platform. Because now,
because it's masked, it doesn't fall into the
PCI category. And this
is how the overall processing looks like.
So as you can see, the files start arriving.
It's a bit smaller for me. Yeah.
So the file starts arriving into the PCI storage bucket
and this is all running in GKE.
In Kubernetes, Rclone is running as
a cron job in Kubernetes. For object storage we are
using GCP, GCP's gcs
and for queues we are using pubsub at the moment.
So just so you know that we are completely GCP based,
but most of our workloads and most of our processing power
runs on GKE. So it's a
good mixture of being cloud native as well as cloud agnostic
at the same time. So these connectors are running, they are pulling the file from
SFTP or storage or s three, and then these files arrives into
object storage. The moment the file arrives in object storage,
a file event has been created that okay, the file is
created, the files is created, the file is created, and that goes into a pub
sub queue. And then based on the number
of messages in the queue, the HPA will scale the workload. We will
talk about scaling in detail in the later part of the presentation.
We typically have around 300 pods running at peak
hours across 40 nodes to process around 300
or 200k files within like 30 minutes.
Pods will then fetch the file information from these events.
In the pub subtopic process, the files mask the content if required.
We have very strict slas and scaling of the platform
to meet those slas is a critical part of our architecture.
We are using horizontal pod scaling and cluster auto scaling together to
scale the platform.
A bit more about auto scaling now. Auto scaling is a crucial
part of our platform.
We have to process these files as soon as they arrive so
we can perform the settlement and billing operations and also pay our merchants
and do the reconciliation of the money,
which is very crucial to our business also.
On the other hand, we also wanted to make sure that our infrastructure is cost
effective and we are not running workloads when they are not
needed to be aligned with bit more like a
phenopsy culture. There are
a few challenges we faced when we implemented horizontal port scaling,
setting up resources like on the pods,
that was a bit difficult
to decide what should be the starting request and what should be
the limit. I know there has been a lot of talks
in kubernetes that we don't need to put limits and stuff like that, but in
our case we had to do it because we are scaling so many pods and
pods can consume like a lot of memory and resources. So we
tried a lot of different options when we were
trying this in production. We started hating pagerduty
but eventually leveraging worked out for us and
it's now scaling quite nicely. And there are two types of scaling
available, right? High level two types. We have horizontal
auto scaling which updates a workload with the aim of scaling
the workload to match the demand. It actually means that the response
to increase the load is deploy more pods and if
the load decreases, scale back down the deployment. By removing the scaled
pods. Then you have vertical auto scaling means assigning more
resources, for example memory or cpu, to the pods that are already
running in
the deployment. You can trigger.
There are multiple ways to trigger this auto scaling you can trigger based on resource
usage. For example when a pods given memory or cpu exceeds
a threshold, you can add more pods if you
want to. Then metrics within Kubernetes
any metrics reported by Kubernetes object with the cluster, such as I
don't know, input output rate or
things like that. Then also like metrics coming from external sources
like pub sub. For example you can create an external metrics based
on the size of the queue. Configure the horizontal pod scaler to
automatically increase the number of pods when
the queue size reaches a given threshold and to reduce the
number of pods when the queue size shrinks. This is
exactly what we did and this is exactly we
did and used to scale our platform we
are talking about so far. We talked about HPA and adding more number of
pods to scale the deployment.
But we also need to remember that Kubernetes cluster
also need to increase its capacity. How do we
do it? How do we fit all these scaling
pods into kubernetes? For that we
need to add more nodes and we use Kubernetes cluster
autoscaler.
Kubernetes cluster or autoscaler is a tool that automatically adjusts
the size of kubernetes cluster by scaling up or down by adding or
removing nodes. When one of the following condition is true,
there are pods in the pending state in the cluster
due to insufficient resources. And there are nodes in the cluster
that have been underutilized for an extended period of time and
their pods can be placed on other existing nodes.
One very thing, very important thing to remember is that if your pods have
requested too few resources when it first started,
and after some point your
pod wants more cpu or more memory, but your
node in which your pod is actually running
is experiencing resource shortages.
In this case, cluster autoscaler won't do anything for you.
We actually did not read the documentation properly
and we believe the other way around and we lost good couple of days trying
to figure it out why the processing is very slow.
You'll have to go back and revisit the resources for the ports all the time.
Now we know the HP and CA works together to scale the
platform and
it works hand in hand and our
files processing can become very
scalable all of a sudden because we adopted kubernetes
and all these flavors of scalability with it.
And this is how it actually looks like right now.
If you can see, I'm going to take you back again to the
processing. You have object storage. All the files are landing and landing and
then the queue is becoming, having those events of file creation,
right, and the queue is becoming big and big and big. Now how
do we scale it? We said unact messages meant.
That means this is a pub sub terminology, but that means
not processed events divided by four is equal to number
of pods or number of workers required.
But we still have a maximum
limit as well. For example, if I have 500k files
still needs to be process, I cannot be running, I don't know,
125k pods. So we can
still say okay, maximum 300 or maximum 400 pods can
be running at a certain point in time on this particular cluster.
And that actually works because the file processing is very fast. So within few
seconds it process one file or within a second or
so.
And this is how the non PCI platform looks
like. And sometimes this is the platform where all the
validation, all the schema
contracts, all the monitoring events and everything
kind of is encapsulated into this
group of microservices or collection of tools or infrastructure
as you can say. So all the collector,
it works exactly the same as PCI, just a bit more that it
has some extra flavors of schema history and file
stores and things like that. So all the connectors
send files into the source bucket,
which is non PCI bucket. Then file creation events generate
that file creation event into the topic. Then HPA comes into
the picture it scales the deployment of
these translate parts. We call them translate because they're doing the translation based
on the config provided. And then the CA cluster auto scaler
kicks in to scale the cluster, and then the translate pods actually
parse and translate every single file, every single source file
into chunked, every files. Now translate process completely works
based on what's inside the schema of
the file. And this is where it becomes more like a self
serve data platform. So every single pipeline
belongs to a single domain. Payments have its
own pipeline here, marketing have its own pipeline here.
And every single schema history, if you see there is
like one schema history at the end of the day and there is a UI
on top of it from where users can log in,
go and challenges the schema, they can say okay,
now this file contains an extra column, and I want
to process this extra column, but I don't want to go
to the data team and raise a request and ask them to process this.
I would like to do it by myself.
And this is how it happens. So they go
to this web interface and there's schema version,
schema name, source, blah blah blah, lot of information there.
We are actually trying to make it a bit more nicer now.
We are actually taking away all the configuration out.
Now we have Argo CD workloads and things like that.
So we're taking that all out and we're just leaving the schema bit there.
But the gist here is like the users can
actually, or the domain owners or domain, those teams
can actually manage their own pipelines by themselves. We provide
the data catalog by using data hub, they can discover
everything, what they need to do. That is also an ongoing
project at the moment, but it's very interesting. Maybe someday we'll talk about this more
then we provide monitoring on top of it. The schema registry
also have a component where you can say, you know what,
I want to know when my file arrives and
when my file lands into the bigquery or my data warehouse.
And if that doesn't happen by 09:00 in the morning, I want to
get alerted because my processes are going to fail and
I need to notify downstream stakeholders.
Any other reason? So that's how
the schema registry plays a very key role. And at the end of the day
it's a data contract between the source and the processing and
the target. When we
process files from PCI to non PCI environment with the
help of schema SD, the files moves from many different stages throughout
the whole processing journey and the state management
of the file becomes very important because you
want the file processing to be fault tolerant. You want to
handle errors when the error happens.
You also want to support kind of live monitoring. You also want to
prevent duplicate processing, because queues can
have duplicate events. And you
want to make sure that once the file is processed, is processed.
Now let's see how the flow works. So object file is created,
it goes into the file event, subscription topic, for example.
So then it's kind of in a to do state. And after that
the HPA is listening to that topic,
and then it say, okay, let's scale everything and
all the pod starts consuming from this topic, and they consume
from this topic. They goes first to the store,
file store, or you can call it a data store, a state store,
to check whether the file is already being processed by some other pod or not.
If that's the case, then they skip it. If not,
then they put that status into in progress, and then they start processing it.
And if everything is succeeded, they said, okay, translate is
completed, or translate was started before translate completed without any
error, everybody's happy. And if
that doesn't happen, if there's an error, then they say, okay, there was an error.
And for some transient errors, we can actually resend
the files to be processed again. So they put the status
of the file to to do again so that some other pod can pick it
up.
This is also very useful for monitoring.
So all these events are being sent into metricstore.
And then we have a file monitoring service which actually talk
to the schema registrar, and based on the config provided in those
schemas or feeds, what we call it,
actually aggregate this information and start generating
metrics, report or alerts, or send
more aggregated information to Grafana and
for infrastructure observability. So most,
as I said before, everything what we do and
everything what we run mostly is on kubernetes. So we
are running Prometheus integrations at the moment, getting all the important metrics, such as
resource usage pods, health nodes, health pub sub
queue metrics, et cetera, whatever,
literally to Grafana. And then we have live dashboards running,
which actually reflects the status of the platform.
And the users can actually go and see their own feeds,
their own pipelines, and see if anything
is down or not. And they also get alerted. We also get alerted
because the centralized team own the infrastructure, so they actually come as
a second line support to fix if the issues are happening there.
I'm not going to touch much into the analytics platform,
but analytics platform is made basically
to analyze all the raw data
which is coming into bigquery and then run some DBT
models and then create those drive tables and
then the insights out of it.
Leveraging is based on Kubernetes and then this can be
also not. This can be. This is really this whole deployment
of analytics platform is owned by every single domain. So payments
have its own, marketing have its own, customer has its
own, everybody have their own kind
of analytics platform. This is my
kind of like a showcase end to end file monitoring.
And this is actually the slack message. Look like if you see that stage
one, stage two, stage three, stage four, stage five and see the stage
four is failing and the user can just click on it which file
is failing and then from there we have playbooks and then
we have ways to identify the errors
and ways to fix them as well. This is how the
overall ecosystem of data platform looks like. I just talked
about today the PCI platform and data file processing platform and only
touched the analytics platform, but we have done a
lot of work in streaming side, we have done a lot of work in discovery,
observability, quality, governance,
developer experience and we are still doing a lot
more. And we still have to go a long way to
completely embrace this self serve data platform or data mesh.
And if anybody is interested please join.
Go to this dojo career page, not just data team. We are hiding across
I guess all the functions. And of course thank you for your
time. And if you want to dm me directly
or connect me on LinkedIn, this is my, this is my profile.
Thank you so much, have a good day.