Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Nikita Melnikov.
I am VP of engineering at Atlantic money.
And today we will talk about affordable currency, about financial
systems, about its challenges.
And how to use Acro model to solve everything we need
in this kind of systems.
So about me as I've said before, I'm VP of engineering at Atlantic money.
Also I am Ex Syncop Bank and Ex Syncop Investments.
I have more than 10 years in FinTech.
I was working on high load systems.
We had more than 300, 000 requests per second.
And also I love Scala, Golang, Postgres and Kafka.
Let's begin.
Let's take a look on the agenda.
We will cover few topics.
We will go from the problem to the solution that I propose.
The problem is about financial transactions and
what problems could we have.
After that we will shift to traditional approach and their limitations.
And we, of course, speak about asynchronous processing, about
Kafka, how to implement asynchronous processing using Kafka, why do
we need it, how it could help us.
And the main topic is AccuModel, of course.
Let's start.
So what is financial transaction?
We will use our company as an example, and we will review the standard
process for our cases customer can create a transfer in our app.
So for example, you want to send 100 to 90 euros to someone.
You're creating the transfer in our app.
You send 100 payment to our app and we can send euros to your recipient.
Let's simplify it a bit.
So we are waiting for dollars.
We run some checks.
We send euros.
Sounds simple, right?
But in fact, what happens in the real world?
First of all.
As we discussed before, customer creates USD to Euro transfer system, weights
for USD, the same as before, right?
System writes the payment details.
So we should know when we've got your payment amount sender, recipient,
your bank information and any other information that we need.
After that, system runs on checks.
What checks?
For example, we must check sanction lists.
We should check your payment in our anti fraud system.
We should check your payment limits.
We should calculate fees and many more.
As many as you can imagine.
Of course, we have to exchange currencies to send another
currency to your recipient.
And finally, we can send EUR to the recipient.
Where is the problem?
Of course, there can be many problems.
In terms of compliance, operations, finance and so on.
But we will review only one technical problem today.
The problem is lost update.
Let's take a look on the diagram.
What happens here?
When transfer system receive your payment, we should find
proper transfer in the database.
After finding, we should check the sender and recipient, for example,
in our compliance service, we are getting, okay, everything is fine.
So we can go further.
We are processing the request, doing some relations, applying
fees, exchanging currencies.
And finally, we should update the transfers.
So just to set the status, let everything's fine.
So we've paid out the money for your recipient.
Where the problem?
The problem is when compliance system decided to cancel the transfer
at the same time while we are processing the original transfer.
So the process looks mostly the same, but after successful checks in Compliance
service can decide to cancel the transfer.
So it hits in point transfer slash cancel.
And we should select transfer for restriction and update transfer as well.
Let's take a look on two transactions here.
On the left, we have compliance transaction.
On the right, we have our payment to our system.
So we are starting the transaction.
We are selecting transfer We are getting result.
So result is the same.
So ID 1 status created.
Everything should be fine in both transactions After that, we
are setting different statuses.
For compliance, we should set status cancelled.
For our payment, we should set status payment received Okay, cool.
We're committing the changes Where is the problem?
The problem is here If we will select transfer right after this
changes, we can get undefined status.
So because we have two concurrent transactions and we don't know
what exactly we want to have at the end, and we cannot guarantee
the order of the statuses.
So the status can be undefined, it can be canceled, it can be payment
received or any other, of course.
Let's try to solve it and we'll talk about traditional approaches.
Of course, I don't have the time to provide all possible
options, how to solve it.
And also we are limited by a technology that we use now.
Let's imagine we have Postgres database, right?
And we can solve it like a, in, in that way.
We have database transaction, how it looks like the same as before.
We're starting the transaction.
We are selecting transfer for update now, and we can commit the changes.
Everything is okay.
Let's try to implement it on the diagram.
So this is the same diagram as before.
Except just one small thing.
Instead of just selecting the transfer, we are selecting it for update.
We are starting from payment receiving process, so we are selecting for
update, we are checking sender and recipient, everything is fine.
At that moment, compliance system decided to cancel and restrict transfer.
It does select for update as well.
But it should wait until we have one open transaction for update for
one single row or for many rows.
Postgres knows that everything should be blocked until the
first process will be completed.
So the one, the first process when we are receiving the payment is Pending
now, so we are selected transfer.
We are doing some calculations.
We are working on the processing.
And finally, we do update on the step six.
Once we committed changes, we can continue with the second process.
We're selecting we already sent select for update request from
compliance service the, our database, and we were waiting for the result.
And now the result is finally here and we can continue.
With everything solved, actually.
So everything is super simple here.
So there are only two Database transactions.
We don't have any new abstractions or any new tools, but where is the limitation?
Let's make a quick calculation.
So let's imagine we have processing time, average processing time, five seconds
And we have 100 operations per second.
That means that we should have 500 active transactions.
Doesn't seem complicated, right?
Postgres definitely could handle this.
And yes and no.
There are two drawbacks.
First of all is resources.
We have many active transactions per second.
that do nothing, actually, because we are waiting, we are going to external
services, we are writing some audit data, we are getting data from another service.
So many things.
So while we do this transaction is active and we cannot work with this locked
row, for example, with this transfer.
And the second point is about your connection pools.
So I believe most of you have a connection pooling system in your apps.
So for example it can be limited up to 16 connections to your Postgres
instance and everything should be fine actually, because we have different
transfers, but the problem will be if you have multiple concurrent.
Transactions in the database.
Let's imagine we have 16 concurrent process at the same time So that means
that the first connection will be acquired by actual transaction So we
are we got the result from the database.
We start doing something We are processing it to five second and finally we can
release during this time The second operation will be waiting at the third
and the fifth and the sixth and so on so The contention of the on the lock
and on the data database in fact if we Want to work with the same transfer?
it will be quite a problem because at the end you can end up with just
a Empty connection pool because every connection is busy And you cannot run
anything else on these connection pools.
You cannot serve customers requests, you cannot serve in other
transfers, and here's the problem.
Okay, let's try to solve it in another way.
Let's try locks.
Of course, locks can be implemented in different systems in different ways.
But in fact, today we'll talk about local locks.
And distributed locks.
Let's start with local locks.
In Golang it can be done via just a mutex object.
How it works?
We should run something on the transfer.
We are trying to acquire the mutex and defer a function to unlock it.
If mutex was already locked the goroutine will be unlocked.
We'll be painting actually.
We will wait until the lock will be resolved.
Everything is simple, right?
But the problem is obvious, actually.
So if you have multiple nodes.
You have multiple endpoints or processes or whatever, or maybe your
consistent rotor on NGINX or any another balancer was broken or after
start you can end up with this scheme.
So one node.
Holds a lock, local lock for one transfer and the node 2 got a request for the
same transfer with different endpoint.
That means that locks doesn't work at all, so you don't need it.
Let's try another approach.
So this is about distributed locks.
Of course, it's become a bit A bit complex because you shouldn't produce
new infrastructure things here.
And also you should understand how it works and your engineers as well.
So let's start with diagram.
So the key difference actually is that instead of just that.
Mutex in a single node application, you have distributed
lock manager, how it works.
Node 1 can request for lock in a lock manager, and if it was not acquired by
another process, we can run the lock.
If node 2 will try to request the same lock, it will wait.
So it works the same.
So if node one has the access, it can safely access some resources.
And if everything is done, so we can just release the lock.
And right after that, lock manager will run the lock to the second node.
Everything is super simple here.
About storages.
Actually, there are a lot of options.
We will just just name a few.
Hazelcast, Zookeeper, UTCD, Console, Redis, whatever you need,
whatever you know, whatever you like, so you can use everything.
What about limitations here?
Limitations are quite Quite complex, actually, because under load,
it can be quite challenging to understand how the system will work.
Because of problem of ordering.
So let's imagine you have multiple nodes, for example, a cluster with 16 nodes.
They can be placed in a different zones, in a different data centers, in
a different networks, because of many things, just have different timing to
access your lock manager, for example, Hazelcast, or for example, Zookeeper.
This is the first problem.
So we cannot understand why we've got a lock to the node that, for example,
call it later than the first one.
So it is quite challenging to understand how it works, just because
of the problem of the ordering.
Timeouts.
There are actually multiple timeouts.
So first of all, it is lock acquisition timeout.
So we As engineers should understand, so I'm trying to get a lock in our lock
manager we should be able to set some time out because we cannot wait infinite
time for this lock because it just, it will just hang in the process, right?
And we should understand how to set this time out, how to guarantee this time out.
And the second point is how to manage this timeout when we already acquired the lock.
Let's imagine we've spent three seconds to wait for a lock.
And now we should limit the time when we process, when we run actual
logic, back end logic, right?
And we already have two points here, so we should respect timeouts.
We should respect holding timeouts and we should understand how to
work with them What should we do if we didn't get the lock?
Probably everything should be fine, right?
So we just didn't do anything What if we got the lock?
but we lost the lock or The process was crashed When we hold the lock
this is challenging, actually.
Because you should respect it in your code.
So every process should understand that it can be it can be dropped.
The server can fail and how to restore the process.
And you should understand this partial state that was in, that was
implemented during the lock holding.
And of course, potential deadlocks.
That means that, for example, we want to hold a lock for transfer, for
check, for customer, for recipient.
And the, Modify the state, we need to touch every object there.
And potentially it can be challenging to understand and to guarantee that
there is no deadlock installed.
Deadlock is when, for example, we've got a lock for transfer, we are
trying to get lock for recipient, but other process started with the
recipient and after the transfer.
The second.
The second transaction got a lock for cpn and after that it's trying
to get a lock for transfer, but it holds this lock is for this, the
first node for the first process.
And it is quite challenging and I don't know how to guarantee that
there is no deadlocks at all instead of just testing, but testing is
super complex in distributed systems.
So I don't have so many times to test distributed system on my own.
Okay.
Let's write the switch to asynchronous processing.
Let's first define what the transfer model is.
Actually, it is finite state machine, right?
Because of transfer has multiple statuses, state transitions occur
via commands, each state defines allowed commands, and commands
trigger actions and state changes.
That's all.
So finite state machine is the same, right?
So it's just a switching before states because of something, because of
some events or commands or whatever.
So we cannot go from payment received to payment waiting just because
it's impossible in our domain.
Let's take a look on the code.
First of all, let's define transfer.
For simplicity, it could have only id and status.
What is VSAM here?
Vini State Machine.
So we have a transfer created.
We have command.
Received payment and we can move our transfer to payment received, right?
So let's define statuses.
We have created, payment received, checks sent, checks pending, and so on.
So all our tree of choices and transitions from status to status.
And of course, we need to define a command.
So a command is something that can happen in your in your application.
It can be done via HTTP endpoints, your PC endpoints
broker messages time or whatever.
So in fact, we need just something to tell to the transfer that it
should move from one step to another.
And let's take a look on the code, how could we handle this comment.
Of course, we we won't review everything here.
We will start with something super simple.
So payment receive payment comment.
First of all, we should check the status.
The code is simplified.
We ignore error handling here error cases and so on.
So just the pure logic.
We should check the status.
So we've got a receive payment command.
We know that status should be only status created.
If not, we should throw some error.
We need to write payment details.
We have to set status.
We have to save the state.
And we should tell to the transfer.
that we need to run some checks to move the transfer further.
Okay.
Cool.
Requirements for asynchronous processing.
So now we know the model.
Now we understand the transfer is actual FSM model and we need comments.
We need messages.
Now we need states.
Let's make some requirements for our system.
First of all, we need communication through messages.
Messages are comments or events or whatever.
This is something that we can tell from one service to another service
that It needs to run something.
Also, we need one at a time message handling.
That means that if we got a command for transfer A, only one command
can be executed for this transfer.
If we have the second, it should be queued somehow.
Also, we need durable message store.
That means that we should keep our messages for some time because of
node failures, location failures, data centers failures and many more
because of many reasons, actually.
Because we we would need to replay messages somehow.
Maybe we need analytics.
Maybe we deployed something broken.