Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems and observing
changes. Exceptions errors in real time allows
you to not only experiment with confidence but respond
instantly to get things working again.
Close hello,
my name is Alpersanavji and I will talk
about distributed transactions in service. Meshes mesh today let
me introduce myself in the beginning,
as I mentioned, I'm Alparsonavji and I'm working as a software developer at
Zapata Computing and I'm
currently working with the platform team. We are dealing with the cloud native
issues. We are building the backend
applications for the quantum engine for
the quantum computing team and you can always
access me, you can always reach me using my email which is alpersonalgy@gmail.com.
Let's start our talk with imagining a large
ecommerce business. Imagine that this
ecommerce business has thousands of transactions per
day and there are multiple variety of goods
exist in the business.
There are multiple locations and time zones that people are
using. Our business, our websites, maybe we
have multiple websites. And also imagine that you
are a developer in this company and you have more
than hundreds of colleagues, technical colleagues working
on the applications and also DevOps.
And you have multiple specialized teams.
So these specialized
teams are also working in different locations and remote work. We can say
that this is the modern enterprise company
structure. So one day your CTO
or your executives come to you and ask this
question, if it
will be in your hands, how do you set up our software architecture
from scratch? Which approach will we use?
Will you use a monolith or we use a microservices
approach. So when we are talking
about monolith and microservices, this is not just a
simple approach. There are multiple then different approaches that you
can think when we call these approaches.
But just we know that monolith is really
difficult for these kind of enterprises and
we will most likely eventually come to
a point that we will use Microservices approach.
So eventually you choose Microsoft's architecture,
but the problems just start with there.
Firstly, you will have multiple different technologies
running on different environments
and you need to make them run stable
and reliably without having any
issues. So you might have front end services,
you might have back end services that are talking to each other.
There will be clients,
external clients that are accessing to your front end
services most likely. And there might
be some message brokers that
is used to communicate these services. And there
will be most likely some multiple different system of records,
third party libraries and so on. So you need to deal with lots
of problems. First we need to understand
where we set up our microservices,
our microservices, as you know, in the modern
ecosystems, application ecosystems, they are most
likely in a containerized environment
which we most likely will be using Docker for container,
as a container engine, and there as the container orchestration
tool, we will be most likely using kubernetes.
And as the number of Microsoft increase, then you
will need new tools. And you know that the number of Microsofts
will increase eventually in time because you will need new features and
also you will need some scalability. You will
end up with some scalability issues. So the number will increase.
And with these new tools, actually you will need new tools because you
will need more ease of maintenance on your application
environment. So these new tools
will include some kind of service meshes. So what's the
service meshes? This service mesh is
an infrastructure layer application and it
runs in the infrastructure layer and enables the security secure communication
in the infrastructurities amongst
services. So we can say that it is handing
over the east to west communication among
services. And by taking over this communication
you can control the whole interaction
between these services. For example,
by taking over the communication you can
manage the traffic posts, you can manage the security,
and also you can provide some observability to your developers
and users. Currently there
are multiple service mesh products, but the top three is
istio, console and Linkerd. The most popular one is the istio which
we will also use on our demonstrations,
but Linkerd and console are also used. There are also some
other applications that the implementations of service
mesh okay, we set up
our service mesh and we tried to solve some issues,
but since you have multiple services and multiple
system of records, you will most likely end up with dual
write problems which we will mention in a few seconds.
This means that you will need to access multiple system
of records at the same time and you will
need transaction operations. So since you
are running in a distributed environment, prepare yourself to distribute transactions.
Let's talk about the dual write problem in the beginning.
So this represents
a really small part of our large ecommerce
business. Imagine that there's a front end service
that accept the request from the user and starts
a transaction transaction. This transaction can
be something like purchasing a good
from your website and most
likely you will be accessing multiple serves at the same time
and you will need to run them in the same transaction boundary.
For instance, you might need to access user service,
product, service and banking service at the same time. And these might
be accessing to separate
databases, different databases for instance like this user D to
be banking DB or product DB.
And if the banking service, for instance, if one
of our services crashes,
imagine that we cannot access the banking service in this transaction.
We need to roll back our updates on the product
service and user service. So let's
demonstrate this problem and we
will try to understand more. So firstly,
we will have similar
environments in our demonstration.
So let's first see our
application. Yeah, we will have multiple services, we will
have the user service and we
will have deployment for the user service. Another deployment for the product
service, another deployment for the banking service.
And the one that will serve as
the front end service will be the ecommerce service. And as you
can see here, this ecommerce rule will know about the other three
services so it can communicate through
to these microservices. So let's apply
this one's.
Our deployments is being applied as
we have source by the way. Meanwhile these
source are created. I also need to mention
we have already installed istio
in our environment. So let's see get
services istio
system. So you can see that
there are multiple sources running for the istio.
So when we check the pods
existing in the default namespace, you will see our
pods are now running and you will see that there will be
two ready containers in each pod. One of them
will be the actual pod for the application and the second one
will be the sidecar,
the proxy sidecar for the istio service mesh.
So our service are running, our pods are running. Let's check the
services to get the external IP
for the front end service. This is the external IP that we'll be using
and we have also our kiali
frontend to see the running services.
This is the tools that come out of the box
with istio service mesh. So you can see that the
default Kubernetes service and also the other services that
we are running are all here. So let's
try to access our front
end which will be running on the
881 port.
Yes, this is the simple front
end for our ecommerce.
You can see that there's only a single product and there are multiple
information, there are multiple info about the product
and also the user and the store itself. So we get
these product information,
the price, the stock and other information from the
product service and we have a available user credits
here which is 10,000 pounds. You can
imagine this one as some kind of
available credits to the user which might be already applied
by a voucher or
something like that. So we will be using these
credits and this will be also accessing to the
user service. So let's try to add this basket
and try to buy this product.
So in this place,
in this area, we will specify how many products
that we are buying. Let's think that it will be. Let's imagine that it
will be three. And you can see that this
part will be deducted from your bank account. It means that
in our application this amount will be deducted
from the banking service. From the banking service.
So we are trying to access banking service and
this part will be deducted from your available user credits.
So let's say that we will get 3000
from our available user credits and the
remaining 10,500 pounds from
the bank account. When I click
on buy now, this application intentionally
throws an error on the banking
service. So we will see an error
here. But what we expect is actually in the dual write problem that
we are trying to demonstrate. Since we have an issue on
the banking service, we don't need to
change the product, the quantity, the stock
of the products since we haven't purchased these ones. And also
we shouldn't see any issues with the available user
credits. But since we don't have any transactional operation
right now, we will have some problems.
Let's click on buy now. Yes, we have
seen the internal server error here. So let's go back to our
main page. And now you can see that the product stock is now
seven, which was ten in the beginning.
Now we have three quantities
left lost during the transaction.
And also we can see that the current available user credit is
7000, which should be 10,000. Because we haven't completed
the transaction. We had issues on the banking service. So this is the
representation of the demonstrations of the dual right
problem. So what
we need to do to figure out this problem. There are multiple approaches
that you can use, but the most popular ones are the two phase commit and
distributed sagas. We will start with the two phase commit.
This two phase commit is actually very similar to
the local transactions. It is trying to approach the local transaction
approach to the distributed environment.
You will have a coordinator and there will be two phases as
mentioned in the name. The first phase will be the
prepare phase. When the user try to
start a transaction, the front end service
will communicate to coordinator and the coordinator will
start the first phase, the prepare phase. It will communicate to
all of the services and ask for the changes to
be ready. So they will make the changes ready
and then they communicate back to the coordinator and say that yes,
we are ready for the second phase. If any of the services
fail in the first phase, then the coordinator will
roll back the transaction by communicating again to the
remaining services. But if every service says
that we are ready for the commit phase, then the coordinator
will fail. The coordinator will commit the transaction. Yeah,
as we mentioned, for instance, when we banking service says
that it is not ready, I cannot commit, there is a problem on my
side. Then the coordinator will call to user service and product service
to roll back their changes.
The advantages of the two phase commit is that the
strong data comes consistency. So you
will have the consistent data at all
times. And also you will have the read
write isolation because you will
not have any stale reads or something like that.
But the negatives or the disadvantages of the two
phase committee is that it is a blocking
approach. So when it is running,
when you started transaction,
you won't be accessing the same
records at the same time.
So it will be running blocking, which is a really big negative
in a distributed environment. Another thing is that your
data sources should be xa compatible and you might
have some scalability issues because of the coordinator.
And also, again, since the coordinator is some kind of single
point of failure, you need to handle the failures on the coordinator
and figure out a solution.
So the other approach is the distributed sagas,
the distributed sagas that we will use also in our demonstration.
So this is a more suitable approach for
distributed environments because, you know, in the distributed environments,
the applications are responsible for their own running
without discussing any, without having to know
of the others. So they are isolated from in some kind
of terms. So again,
in the distributed sagas, we can say that every service is
responsible with its own transaction, and if it
fails to perform the transaction, the operation,
then it lets other know about its failure,
and the other microservices will take care
of the rest by themselves.
Imagine that we have the
same scenario. The user access the front end
service and front end service will call, the application will
start a transaction by just calling the services.
The user service will start the transaction itself, the product service will
start a transaction for itself, and the banking service will do so.
And let's imagine that the banking service will fail again.
When it fails, it will just call the
compensation actions of the other services. So it will call the product service,
and it will call the user service using their compensation calls.
What's the compensation call? Actually?
Compensation action. We can say that it is semantically undoes the action.
So you need to develop another
logic to undo the action.
It might be kind of deleting a record from the database or
changing the parameters or something like that.
So it is not a proper rollback,
but changing, semantically,
changing the updates, rollbacking the updates.
So this figure actually
represents the saga better because
it is just a serial
run of multiple services. Imagine that you are
becoming a trip, and when you are booking a trip, you need to first plan
the meeting, then book the flights, then book the hotel,
and then book the transport. And again,
imagine that when you are using these operations,
you plan the meeting successfully, then you book the flight successfully,
but you had an issue with the when booking the hotel.
So the hotel booking will cancel itself
and then call the cancellation call
of the book flight. So the book flights will be canceled and then
the book flights will cancel the plan meeting by
calling this compensation action. So this
is the simple representation of
the distributed sagas. We also need to mention about
two approaches. There are two basic
approaches on the distributed sagas. The first one is the orchestration, and the second
one is the choreography approach. In the orchestration
approach, there will be orchestrator
to actually orchestrate manage the
transaction itself. It's not the same thing in
the two phase commit, the coordinator in this two phase commit,
it just helps
the distribute applications to communicate them,
each of them. In the choreography approach,
all of the services are responsible for
their own, and they most likely use kind of event source
or something like that to communicate each other.
Yeah. The advantages of this sagas are, as you
know, it is running asynchronous
and it doesn't need a homogeneous environment.
It can run with heterogeneous distributed components. And also
we don't need any xa transactions,
we don't need any xa specific thing. And also it is
scalable because all of the
applications, all of the services are running their own transaction
or responsible from their own transaction.
The disadvantages of the distributed sagas is the eventual
consistency. There is no strong consistency in distributed sagas
because at some time you
can have some sale rates. For instance,
imagine that when you have the issue with the banking
service, when it has the failure, then you might be
reaching the same related records in the user database
from a different application, and you will be the updated
information at the exact that time. So there
will be also no write read isolation.
Another issue with the distribution sagas is you need to develop
a separate compensation function. You need to
provide a separate logic for that one. There's no simple rollback,
and you also need to handle the compensation failures.
So let's talk about our solution.
So we need to first discuss about, we need to talk about Istio,
the architecture of Istio. As we mentioned
in the beginning, the Istio
actually is a service mesh that handles
the service to service communication.
So you can see that there are multiple pods in
this system and there are two planes that we can mention.
The first one is the control plane that runs the istio's
own applications. And also in the
data plane your applications will be
running. And in the data plane on each pod
there will be a proxy. Istio uses envy proxy to communicate,
to hand over the communication.
And here you can see that the ingress traffic comes into the
applications and the proxy hands
over that one and you can again
see that there is no direct communication between service a
to service b. It will be only happens
through the proxies. So the east to
west communication will be always happened between the
proxies. Another feature that we'll be using
is, will be the envoy proxy filter chains.
This is a really useful feature of envoy that you can
enhance or you can block the data communication.
For instance, you can have some HTTP filters that you can
add more information, some header, update the headers or
change the data itself using these
filters. Or you can even block or you can cut
the circuit using these filters at any time.
Our solution can be demonstrated
like this. Again, we will have user service
and product, service and banking services. In these pods we will
have the envoy proxies provided by the istio
service mesh. We will attach a filter
to these envoy proxies and these filters
will be communicating to our own application which will be
calling propeller app.
These filters will be providing information
about the transaction and this propeller app will
just store the transaction information on a hazelcast
cluster. So let's see
in action. Our application is still running
and we can also see the Kiali console.
Yeah, you can see that this is the
previous run of our demonstration. So the ecommerce service,
we access the ecommerce service and ecommerce service access to
user, product and banking services accordingly.
So here we will first create a
hazardous cluster. This hazardous cluster will
have three replicas and we will
be accessing this hazardous cluster using a hazardous cluster service.
And our propeller application will again have three
replicas and it will be just storing
the data of the transaction data in the hazardous cluster.
So first let's start, let's apply the
deployment, the configuration for the hazardous
cluster.
Okay, our hazardous cluster is being created and
let's apply the propolar app
configuration. Yeah, they are all being created.
Let's check the services or
pods first.
They're all running. Yes, all of them are nearly
running. The last instance of the proper rep
is also ready. Let's check one last time.
Everything is running and you can see two
containers in each pod. One of them is the actual pod. The second one
will be the envoy proxy.
So what will we also do is as we also mentioned
in the diagram, we will apply a
filter to envy proxy. So let's first apply
it and then discuss about details.
It will be propeller filter yaml yeah it's
created. So let's
discuss about the propeller filter.
This is the part that the transaction,
the saga will be running on. So this is
the kind of envoy filter. And this envoy
filter is a custom resource provided by the istio and
it's an HTTP filter. It is running at the HTTP level
and it is inserted before the operation.
So it will be running before the operation for each operation
on the pods. So it's
a Loa filter. You can have multiple options here,
but I will be using LoA to implement
this filter. So there are
two functions, two main functions. One will be
running on each request.
The second one will be running on each response into
the pod. So what will be on the request is
we will just get the transaction id from the
headers. And from this header we will
get this transaction id and send it to the
propeller app, which you can see here, there's a
put key and value, just an entry. And here
is the address of our internal address of our internal
domain of the propeller app. The second function
is actually on
response function. On each response we will
check the status of the HTTP status of the
response. So if there's a failure on
the response, if it is a failure, then we will say that we
will call the compensation actions. How we will do that,
we will get the transaction id again from the header and we
will communicate the propeller app to
get the information about the information about the
current transaction. We will get the transaction information
and for each endpoint we will call the compensate
actions for this transaction.
And so the compensation actions will be called.
So again we will go and check our
envoy filters.
Yeah, this filter is applied.
So let's try to demonstrate our demo
again. So again, we have the
same application as you most likely notice that
we haven't redeployed our application. We have
just applied filter. So let's try to.
Again, you need to see that our current
product stock is seven and available user credit is
7000. So let's try to purchase
this product again. For instance, this time I
will get four of these products.
Anyway, let's continue with this one and talk about it later.
And also let's get the 5000
from our user credits.
So when I click on buy now,
if the distribute
stagger works then the product
stock and the available user credits shouldn't be changed.
It means that the product stock will stay at seven and
the amount from the available user credits should be
stay at 7000. Let's click on buy
now. Yeah, we have the internal
server error and when we go back yes, we can
see that the distributed saga worked because we have
rolled back the updates on the available user credits and
we have rolled back the product stocked on the product
service. And also we
can check here, there's a detailed representation
of the operation. You can see that the product
propeller app is communicating to Helzikes cluster.
The banking services failed. It is still not
properly updated because we will see that the banking service
will be communicating to proppler app and then this banking service
will be also communicating to product service and user service in a few seconds.
But let's continue what
happens. Let's try to discuss what happens.
So the banking service app return
that HTTP 500 response and on the
my proxy we
intercept that response and then we check that
since it is at 500 then we access
the propul wrap, get the details of the transaction
from the hazelcast cluster and then with
the information that taken from the cluster
we call the compensation actions on each service
on which will be the product and user services.
So the compensation actions worked and then we roll
back the transaction. We can again. Yes, here is
the updated version of the it's a pretty complicated
one, but you can see that ecommerce service contacted to
the profiler app and then it contacted the banking service.
Sorry, the banking service contact the proposed app and then banking
service get the call to compensation actions of the user
service and product service.
Yeah, this is the end of my talk and
I hope you enjoyed it. Just I need to say that.