Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Alpersanavji and
I will talk about distributed transactions service mesh search mesh today.
Let me introduce myself in the beginning,
as I mentioned, I'm Mark Personavji and I'm working as a software developer
at Zapata Computing and I'm currently
working with the platform team. We are dealing with the cloud native issues.
We are building the backend applications
for the quantum engine for the quantum computing team
and you can always access me, you can always reach me using my email
which is alpersonalgy@gmail.com.
Let's start our talk with imagining a large
ecommerce business. Imagine that this ecommerce
business has thousands of transactions per day
and there are multiple variety of goods exist
in the business. There are multiple applications and time zones that
people are using. Our business, our websites,
maybe we have multiple websites. And also imagine that
you are a developer in this company and you have more
than hundreds of colleagues, technical colleagues
working on the applications and also DevOps.
And you have multiple specialized teams.
So these specialized teams are
also working in different locations and remote work. We can say that this is the
modern enterprise company structure.
So one day your CTO or your
executives come to you and ask this question.
It will be in your hands. How do you set up our software
architecture from scratch? Which approach will we use?
Will you use a monolith or. We use a microservices
approach. So when we are talking
about monolith and microservices, this is not just a simple approach.
There are multiple then different approaches that you can think
when we call these approaches. But just we
know that monolith is really difficult for these
kind of enterprises and we will most likely eventually
come to a point that we will use microservices approach.
So eventually you choose Microsoft architecture, but the
problems just start with there.
Firstly, you will have multiple different technologies
running on different environments
and you need to make them run stable
and reliably without
having any issues. So you might have front end
services, you might have back end services that are talking
to each other. There will be clients,
external clients that are accessing to your front end services
most likely. And there might be some message brokers
that is used to communicate these services. And there
will be most likely some multiple different system of records,
third party libraries and so on. So you need
to deal with lots of problems. First, we need
to understand where we set
up our microservices, our microservices,
as you know, in the modern ecosystems,
application ecosystems, they are most likely in
a containerized environment which we
most likely will be using Docker for container, as a
container engine and there as the container orchestration
tool, we will be most likely using Kubernetes.
And as the number of Microsoft increase then you
will need new tools. And you know that the number of Microsofts
will increase even in time because you will need new features
and also you will need some scalability.
You will end up with some scalability issues, so the
number will increase. And with these new tools, actually you
will need new tools because you will need more ease
of maintenance on your application environment.
So these new tools will include some kind
of service mesh. So what's the service mesh?
This service mesh is an infrastructure layer application and
it runs in the infrastructure layer and enables the security secure communication
in the infrastructures among
services. So we can say that it is handing over
the east to west communication among services.
And by taking over this communication
you can control the whole
interaction between these services. For example,
by taking over the communication, you can
manage the traffic posts, you can manage the security,
and also you can provide some observability to your
developers and users.
Currently there are multiple service mesh products, but the
top three is istio, console and linkerd.
The most popular one is the istio which we will also use on our demonstration,
but Linkerd and console are also used. There are
also some other applications that the implementations
of service mesh okay, we set
up our service mesh and we tried to solve some
issues,
but since you have multiple services and multiple system
of records, you will most likely end up with dual write
problems which we will mention in a few seconds.
This means that you will need to access multiple
system of records at the same time and you
will need transactional operations.
So since you are running in a distributed environment,
prepare yourself to distribute transactions. Let's talk
about the dual write problem in the beginning.
So this represents a
really small part of our large ecommerce business.
Imagine that there's a front end service that accept the request
from the user and starts a transaction.
Transaction. This transaction can be something
like purchasing a good from your website
and most likely you will be accessing multiple
servers at the same time and you will need to run them in the same
transaction boundary. For instance, you might need to access
user service, product service and banking service at the same time.
And these might be accessing to separate
databases, different databases. For instance like this user DB,
banking DB or product DB.
And if the banking service, for instance, if one of
our services crashes, imagine that
we cannot access the banking service in this transaction. We need to roll
back our updates on the product service
and user service.
So let's demonstrations this problem and
we will try to understand more. So firstly
we will have this
similar environment in our demonstrations.
So let's first see our application.
Yeah, we will have multiple services, we will have the user service
and we will have deployment for
the user service, another deployment for the product service,
another deployment for the banking service and
the one that will serve as the front end service will be
the ecommerce service. And as you can see here, this ecommerce rule
will know about the other three services so
it can communicate to these services.
So let's apply this one's
our deployments is being applied as
we have source by the way.
Meanwhile these source are created. I also
need to mention we have already installed Istio
in our environment. So let's see get services
istio system.
So you can see that there are multiple sources running for
the istio. So when
we check the pods existing in the default namespace,
you will see our pods are now running and
you will see that there will be two ready containers
in each pod. One of them will be the actual pod for the application
and the second one will be the sidecar,
the proxy sidecar for the istio service
mesh. So our service are running,
our pods are running. Let's check the services to get
the external IP for the front end service.
This is the external IP that we'll be using and
we have also our Kiali front end
to see the running services. This is the
tools that come out of the box with istio service
mesh. So you can see that the default Kubernetes service
and also the other services that we are running are all here.
So let's try to access our front
end which will be running on the
881 port.
Yes, this is the simple front end
for our ecommerce. You can see that there's only
a single product and there are multiple information, there are multiple
info about the product and also the user and the store itself.
So we get these product
information, the price, the stock and other information
from the product service. And we have a available user
credits here which is 10,000 pounds.
You can imagine this one as some kind of
available credits to the user which might be already
applied by get by a voucher
or something like that. So we will be using
these credits and this will be also so accessing to
the user service. So let's try to add this
basket and try to buy this product.
So in this place, in this
area, we will specify how many products
that we are buying. Let's think that it will be. Let's imagine that it
will be three. And you can see that this
part will be deducted from your bank account. It means that
in our application, this amount will be deducted
from the banking service. So we are trying to access
banking service and this part will be deducted from your
available user credits. So let's say
that you will get 3000 from
our available user credits and the remaining
10,500 pounds from the bank account.
When I click on buy now,
this application intentionally throws
an error on the banking service.
So we will see an error here.
But what we expect is actually in the dual right problem that we
are trying to demonstrate. Since we have an issue on the
banking service, we don't need to change
the product, the quantity, the stock of the
products, since we haven't purchased these ones. And also
we shouldn't see any issues with the available
user credits. But since we don't have any transactional
operation right now, we will have some problems.
Let's click on buy now. Yes, we have seen
the internal server error here. So let's go back to
our main page. And now you can see that the product stock is
now seven, which was ten
in the beginning. Now we have three
quantities left lost using
the transaction. And also we can see that the
current available user credit is 7000, which should be 10,000.
Because we haven't completed the transaction, we had issues on the banking
service. So this is the representation of the demonstrations of
the dual right problem.
So what we need to do to figure
out this problem. There are multiple approaches that you can use,
but the most popular ones are the two phase commit and distributed sagas.
We will start with the two phase commit.
This two phase commit is actually very similar to
the local transactions. It is trying to approach the local
transaction approach to the distributed
environment. You will have a coordinator and there will
be two phases. As mentioned in the name,
the first phase will be the prepare phase.
When the user try to start a transaction,
the front end service will communicate to coordinator
and the coordinator will start the first phase,
the prepare phase. It will communicate to all of the services
and ask for the changes to be ready.
So they will make the changes ready and then they communicate
back to the coordinator and say that yes, we are ready for
the second phase. If any of the services fail
in the first phase, then the coordinator will roll
back the transaction by communicating again to the
remaining services. But if every service says
that we are ready for the commit based, then the
coordinator will fail. The coordinator will commit the transaction.
Yeah, as we mentioned, for instance, when the banking service says
that it is not ready, I cannot commit, there is a problem on my side.
Then the coordinator will call to user service and product service to
roll back their changes. The advantages of the two phase commit
is that the strong data comes consistency.
So you
will have the consistent data at all
times. And also you will have the read
write isolation because you
will not have any stale reads or something like that. But the
negatives or the disadvantages of the two phase
committee is that it is a blocking
approach. So when it is running,
when you started transaction,
you won't be accessing the same records
at the same time. So it will be running blocking,
which is a reallife big negative in a
distributed environment. Another thing is that your data sources
should be xa compatible and you might have some
scalability issues because of the coordinator. And also
again, since the coordinator is some kind of single point
of failure, you need to handle the failures on the coordinator
and figure out a solution.
So the other approach is the distributed sagas,
the distributed sagas that we will use also in our demonstrations.
So this is a more suitable approach for distributed
environments, because, you know, in the distributed environments,
the applications are responsible for their own running
without discussing any, without having to know
of the others. So they are isolated from in some
kind of terms. So again,
in the distributed sagas, we can say that every service is
responsible with its own transaction, and if
it fails to perform the transaction, the operation,
then it lets other know about its failure
and the other services will take care
of the rest by themselves. Imagine that
we have the
same scenarios. The user access to front end service
and front end service will call. The application will start a transaction
by just calling the services. The user service
will start the transaction itself, the product service will start a
transaction for itself, and the banking service will do so.
And let's imagine that the banking service will fail again.
When it fails, it will just call
the compensation actions of the other services. So it will call the
product service and it will call the user service using
their compensation calls. What's the compensation call
actually? Compensation action. We can say that it is semantically
undoes the action. So you need to develop
another logic to undo
the action. It might be kind of deleting a record from
the database or changing the parameters or
something like that. So it is not a proper rollback,
but changing, semantically changing
the updates, rollbacking the updates.
So this figure
actually represents the saga better,
because it is just a serial
run of multiple services. Imagine that you
are booking a trip, and when you are booking a trip, you need to first
plan the meeting, then book the flights, then book the hotel, and then
book the transport. And again,
imagine that when you are doing these operations,
you plan the meeting successfully, then you book the flight successfully.
But you had an issue with the when booking the hotel.
So the hotel booking will cancel itself
and then call the cancellation call
of the book flight. So the book flights will be canceled, and then
the book flights will cancel the plan meeting by
calling this compensation action. So this is the
simple representation of the distributed sagas.
We also need to mention about two approaches. There are
two basic approaches on the distributed sagas. The first one is
the orchestration and the second one is the choreography approach.
In the orchestration approach, there will be orchestrator
to actually orchestrate manage the
transaction itself. It's not the same
thing in the two phase commit, the coordinator in this two phase commit,
it just helps the
distribute applications to communicate them,
each of them. In the choreography approach,
all of the services are responsible for their own,
and they most likely use kind of event source or something like
that to communicate each other. Yeah.
The advantages of this sagas are, as you know,
it is running asynchronous and it doesn't need
a homogeneous environment. It can run with heterogeneous distributed components.
And also we don't need any xa transactions,
we don't need any Xa specific using. And also
it is scalable because all of the applications,
all of the services are running their own transaction
or responsible from their own transaction.
The disadvantages of the distributed sagas is the eventual consistency.
There is no strong consistency in sagas
because at some time you
can have some sale rates. For instance, imagine that
when you have the issue with the banking service, when it has the failure,
then you might be reaching the same related
records in the user database from a different
application, and you will be the updated information
at the exact that time. So there will be also no
write read isolation.
Another issue with the distributed sagas is you need to develop a
separate compensation function. You need to provide a separate
logic for that one. There's no simple rollback, and you
also need to handle the compensation failures.
So let's talk about our solution.
So we need to first discuss about, we need to talk about istio,
the architecture of istio. As we mentioned
in the beginning, the istio actually
is a service mesh that handles the
service to service communication. So you
can see that there are multiple pods in this system and there
are two planes that we can mention. The first
one is the control plane that runs the Istio's
own applications and also in the data plane
your applications will be running and in
the data plane on each pod there will be a proxy.
Istio uses envoy proxy to communicate, to hand
over the communication. And here you can see that the ingress
traffic comes into the applications and the proxy
hands over that one and you can again see
that there is no direct communication between service a to service
b. It will be only happens through the proxies.
So the east to west communication will be always
happened between the proxies. Another feature that we'll
be using is, will be the envoy proxy filter chains.
This is a really useful feature of envoy that you can enhance
or you can block the data communication.
For instance, you can have some HTTP filters that you can
add more information, some header, update the headers or
change the data itself using these
filters. Or you can even block or you can cut
the circuit using these filters at any time.
Our solution can be demonstrated
like this. Again, we will have user service and
product service and banking microservices in these pods. We will have the envoy
proxies provided by the istio
service meshes. We will attach a filter
to these envoy proxies and these filters
will be communicating to our own application which will be calling
propeller app. These filters
will be providing information about the transaction and
this propeller app will just store the transaction information
on a hazelcast cluster. So let's
see in action. Our application is still running
and we can also see the Kiali console.
Yeah, you can see that this is the previous
run of our demonstration. So the ecommerce service, we access
the ecommerce service and ecommerce service access to
user, product and banking services accordingly.
So here we will first create a
hazardous cluster. This hazardous cluster will have
three replicas and we will
be accessing this hazardous cluster using a hazardous cluster service.
And our propeller application will again have
three replicas and it will be just storing
the data of the transaction data in the hazardous
cluster. So first let's start, let's apply
the deployment, the configuration for
the hazardous cluster.
Okay, our hazardous cluster is being created and let's
apply the propolar app
configuration. Yeah, they are all being created.
Let's check the services or
pods first. They're all running. Yes,
all of them are nearly running. The last instance
of the proper rep is also ready. Let's check
one last time. Everything is running and you can see the
two containers in each pod. One of them is the actual
pod. The second one will be the envoy proxy.
So what will we also do is, as we also mentioned in the
diagram we will apply a filter to envoy
proxy. So let's first apply it and then
discuss about details. It will be propeller filter
yaml. Yeah, it's created.
So let's discuss about
the propeller filter. This is the
part that the transaction the saga
will be running on. So this is the
kind of envoy filter. And this envoy filter is
a custom resource provided by the istio and
it's an HTTP filter. It's running at the HTTP level
and it is inserted before the
operation. So it will be running before the operation for each
operation on the pods.
So it's a Loa filter. You can have multiple
options here, but I will be using LoA to
implement this filter. So there
are two functions, two main functions.
One will be running on
each request. The second one will be running on each response
into the pod. So what we'll do on the request is
we will just get the transaction id from
the headers. And from this header we will
get this transaction id and send it to the propeller
app which you can see here. There's a put key
and value, just an entry. And here is the address of our
internal address of our internal domain
of the propeller app. The second function is
actually on
response function. On each response we will
check the status of the HTTP status of
the response. So if there is a failure on
the response, if it is a failure, then we will say that we
will call the compensation actions. How we will do that,
we will get the transaction id again from the header and we will
communicate the propolar app to
get the information about the information about the
current transaction. We will get the transaction
information and for each endpoint we will call the
compensate actions for this transaction.
And so the compensation actions will be called.
So again we will go and check our envoy
filters. Yeah, this filter is now becoming
applied. So let's try to demonstrate our
demo again. So again we have the
same application as you most likely notice
that we haven't redeployed our application.
We have just applied filter. So let's try
to. Again, you need to see that our current
product stock is seven and available user credits
is 7000. So let's try to purchase
this product again. For instance, this time I will
get four of these products.
Anyway, let's continue with this one and talk about it later.
And also let's get the 5000
from our user credits.
So when I click on buy now, if the
distribute stagger works, then the product
stock and the available user credits shouldn't be changed.
It means that the product stock will stay at seven and
the amount from the available user credits should be
stay at 7000. Let's click on buy now.
Yeah, we have the internal server error and when we go
back yes, we can see that the distributed saga worked because
we have rolled back the updates on the available
user credits and we have rolled back the product stocked on
the product service. And also we can check here,
there's a detailed representation of
the operation. You can see that the product propeller
app is communicating to Helzikas cluster.
The banking services failed. It is still not
properly updated because we will see that the banking service will
be communicating to propular app and then this banking
service will be also communicating to product service and user service in a few
seconds. But let's continue what
happens. Let's try to discuss what happens.
So the banking service app returned
an HTTP 500 response and on
the NY proxy we
intercept that response and then we
check that since it is at 500 then we
access the propol rep, get the details of the
transaction from the hazardous cluster and then with
the information that taken from the cluster
we call the compensation actions on each service on which
will be the product and user services. So the
compensation actions worked and then we roll back the
transaction. We can again yes, here is the updated version of
the it's a pretty complicated one but you
can see that ecommerce service contacted to the provide app
and then contacted the banking service. Sorry the banking service
contact the provide app and then banking service get called
compensation actions of the user service and product service.
Yeah this is the end of my talk and
I hope you enjoyed it. Just I need to say
that this solution is not ready,
production ready. There are still some
problems but I tried to tell you about
the available features of the service meshes and what
we can do with what kind of solutions
that we can provide for the transactions transactions.
Hope you enjoyed it and thank you for listening.