Conf42 Chaos Engineering 2022 - Online

Distributed Transactions in Service Mesh

Video size:

Abstract

As we go deeper into cloud-native applications, microservices are becoming a part of any developer’s life. Together with Kubernetes and service meshes, they became the de facto standard in the industry. However, one question arises with microservices: How to implement distributed transactions in such an environment?

In this talk, we will discuss distributed transaction methodologies, talk about real-life scenarios, and provide a hands-on resolution in the Istio service mesh using the Hazelcast application platform. The attendees will easily understand the distributed saga pattern, backend architecture, and the topology of the solutions with live demonstrations.

Summary

  • Jamaica make up real time feedback into the behavior of your distributed systems. Exceptions errors in real time allows you to not only experiment with confidence but respond instantly to get things working again. Alpersanavji will talk about distributed transactions in service.
  • In the modern ecosystems, application ecosystems, they are most likely in a containerized environment. Since you have multiple services and multiple system of records, you will most likely end up with dual write problems. New tools will include some kind of service meshes.
  • Two phase commit and distributed sagas are two different approaches. The advantages of the two phase commit are consistency and read write isolation. The disadvantages are that it is a blocking approach. The other approach is a more suitable approach.
  • So let's talk about our solution. We will have user service and product, service and banking services. In these pods we will have the envoy proxies provided by the istio service mesh. These filters will be communicating to our own application which will be calling propeller app. Let's see in action.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real time feedback into the behavior of your distributed systems and observing changes. Exceptions errors in real time allows you to not only experiment with confidence but respond instantly to get things working again. Close hello, my name is Alpersanavji and I will talk about distributed transactions in service. Meshes mesh today let me introduce myself in the beginning, as I mentioned, I'm Alparsonavji and I'm working as a software developer at Zapata Computing and I'm currently working with the platform team. We are dealing with the cloud native issues. We are building the backend applications for the quantum engine for the quantum computing team and you can always access me, you can always reach me using my email which is alpersonalgy@gmail.com. Let's start our talk with imagining a large ecommerce business. Imagine that this ecommerce business has thousands of transactions per day and there are multiple variety of goods exist in the business. There are multiple locations and time zones that people are using. Our business, our websites, maybe we have multiple websites. And also imagine that you are a developer in this company and you have more than hundreds of colleagues, technical colleagues working on the applications and also DevOps. And you have multiple specialized teams. So these specialized teams are also working in different locations and remote work. We can say that this is the modern enterprise company structure. So one day your CTO or your executives come to you and ask this question, if it will be in your hands, how do you set up our software architecture from scratch? Which approach will we use? Will you use a monolith or we use a microservices approach. So when we are talking about monolith and microservices, this is not just a simple approach. There are multiple then different approaches that you can think when we call these approaches. But just we know that monolith is really difficult for these kind of enterprises and we will most likely eventually come to a point that we will use Microservices approach. So eventually you choose Microsoft's architecture, but the problems just start with there. Firstly, you will have multiple different technologies running on different environments and you need to make them run stable and reliably without having any issues. So you might have front end services, you might have back end services that are talking to each other. There will be clients, external clients that are accessing to your front end services most likely. And there might be some message brokers that is used to communicate these services. And there will be most likely some multiple different system of records, third party libraries and so on. So you need to deal with lots of problems. First we need to understand where we set up our microservices, our microservices, as you know, in the modern ecosystems, application ecosystems, they are most likely in a containerized environment which we most likely will be using Docker for container, as a container engine, and there as the container orchestration tool, we will be most likely using kubernetes. And as the number of Microsoft increase, then you will need new tools. And you know that the number of Microsofts will increase eventually in time because you will need new features and also you will need some scalability. You will end up with some scalability issues. So the number will increase. And with these new tools, actually you will need new tools because you will need more ease of maintenance on your application environment. So these new tools will include some kind of service meshes. So what's the service meshes? This service mesh is an infrastructure layer application and it runs in the infrastructure layer and enables the security secure communication in the infrastructurities amongst services. So we can say that it is handing over the east to west communication among services. And by taking over this communication you can control the whole interaction between these services. For example, by taking over the communication you can manage the traffic posts, you can manage the security, and also you can provide some observability to your developers and users. Currently there are multiple service mesh products, but the top three is istio, console and Linkerd. The most popular one is the istio which we will also use on our demonstrations, but Linkerd and console are also used. There are also some other applications that the implementations of service mesh okay, we set up our service mesh and we tried to solve some issues, but since you have multiple services and multiple system of records, you will most likely end up with dual write problems which we will mention in a few seconds. This means that you will need to access multiple system of records at the same time and you will need transaction operations. So since you are running in a distributed environment, prepare yourself to distribute transactions. Let's talk about the dual write problem in the beginning. So this represents a really small part of our large ecommerce business. Imagine that there's a front end service that accept the request from the user and starts a transaction transaction. This transaction can be something like purchasing a good from your website and most likely you will be accessing multiple serves at the same time and you will need to run them in the same transaction boundary. For instance, you might need to access user service, product, service and banking service at the same time. And these might be accessing to separate databases, different databases for instance like this user D to be banking DB or product DB. And if the banking service, for instance, if one of our services crashes, imagine that we cannot access the banking service in this transaction. We need to roll back our updates on the product service and user service. So let's demonstrate this problem and we will try to understand more. So firstly, we will have similar environments in our demonstration. So let's first see our application. Yeah, we will have multiple services, we will have the user service and we will have deployment for the user service. Another deployment for the product service, another deployment for the banking service. And the one that will serve as the front end service will be the ecommerce service. And as you can see here, this ecommerce rule will know about the other three services so it can communicate through to these microservices. So let's apply this one's. Our deployments is being applied as we have source by the way. Meanwhile these source are created. I also need to mention we have already installed istio in our environment. So let's see get services istio system. So you can see that there are multiple sources running for the istio. So when we check the pods existing in the default namespace, you will see our pods are now running and you will see that there will be two ready containers in each pod. One of them will be the actual pod for the application and the second one will be the sidecar, the proxy sidecar for the istio service mesh. So our service are running, our pods are running. Let's check the services to get the external IP for the front end service. This is the external IP that we'll be using and we have also our kiali frontend to see the running services. This is the tools that come out of the box with istio service mesh. So you can see that the default Kubernetes service and also the other services that we are running are all here. So let's try to access our front end which will be running on the 881 port. Yes, this is the simple front end for our ecommerce. You can see that there's only a single product and there are multiple information, there are multiple info about the product and also the user and the store itself. So we get these product information, the price, the stock and other information from the product service and we have a available user credits here which is 10,000 pounds. You can imagine this one as some kind of available credits to the user which might be already applied by a voucher or something like that. So we will be using these credits and this will be also accessing to the user service. So let's try to add this basket and try to buy this product. So in this place, in this area, we will specify how many products that we are buying. Let's think that it will be. Let's imagine that it will be three. And you can see that this part will be deducted from your bank account. It means that in our application this amount will be deducted from the banking service. From the banking service. So we are trying to access banking service and this part will be deducted from your available user credits. So let's say that we will get 3000 from our available user credits and the remaining 10,500 pounds from the bank account. When I click on buy now, this application intentionally throws an error on the banking service. So we will see an error here. But what we expect is actually in the dual write problem that we are trying to demonstrate. Since we have an issue on the banking service, we don't need to change the product, the quantity, the stock of the products since we haven't purchased these ones. And also we shouldn't see any issues with the available user credits. But since we don't have any transactional operation right now, we will have some problems. Let's click on buy now. Yes, we have seen the internal server error here. So let's go back to our main page. And now you can see that the product stock is now seven, which was ten in the beginning. Now we have three quantities left lost during the transaction. And also we can see that the current available user credit is 7000, which should be 10,000. Because we haven't completed the transaction. We had issues on the banking service. So this is the representation of the demonstrations of the dual right problem. So what we need to do to figure out this problem. There are multiple approaches that you can use, but the most popular ones are the two phase commit and distributed sagas. We will start with the two phase commit. This two phase commit is actually very similar to the local transactions. It is trying to approach the local transaction approach to the distributed environment. You will have a coordinator and there will be two phases as mentioned in the name. The first phase will be the prepare phase. When the user try to start a transaction, the front end service will communicate to coordinator and the coordinator will start the first phase, the prepare phase. It will communicate to all of the services and ask for the changes to be ready. So they will make the changes ready and then they communicate back to the coordinator and say that yes, we are ready for the second phase. If any of the services fail in the first phase, then the coordinator will roll back the transaction by communicating again to the remaining services. But if every service says that we are ready for the commit phase, then the coordinator will fail. The coordinator will commit the transaction. Yeah, as we mentioned, for instance, when we banking service says that it is not ready, I cannot commit, there is a problem on my side. Then the coordinator will call to user service and product service to roll back their changes. The advantages of the two phase commit is that the strong data comes consistency. So you will have the consistent data at all times. And also you will have the read write isolation because you will not have any stale reads or something like that. But the negatives or the disadvantages of the two phase committee is that it is a blocking approach. So when it is running, when you started transaction, you won't be accessing the same records at the same time. So it will be running blocking, which is a really big negative in a distributed environment. Another thing is that your data sources should be xa compatible and you might have some scalability issues because of the coordinator. And also, again, since the coordinator is some kind of single point of failure, you need to handle the failures on the coordinator and figure out a solution. So the other approach is the distributed sagas, the distributed sagas that we will use also in our demonstration. So this is a more suitable approach for distributed environments because, you know, in the distributed environments, the applications are responsible for their own running without discussing any, without having to know of the others. So they are isolated from in some kind of terms. So again, in the distributed sagas, we can say that every service is responsible with its own transaction, and if it fails to perform the transaction, the operation, then it lets other know about its failure, and the other microservices will take care of the rest by themselves. Imagine that we have the same scenario. The user access the front end service and front end service will call, the application will start a transaction by just calling the services. The user service will start the transaction itself, the product service will start a transaction for itself, and the banking service will do so. And let's imagine that the banking service will fail again. When it fails, it will just call the compensation actions of the other services. So it will call the product service, and it will call the user service using their compensation calls. What's the compensation call? Actually? Compensation action. We can say that it is semantically undoes the action. So you need to develop another logic to undo the action. It might be kind of deleting a record from the database or changing the parameters or something like that. So it is not a proper rollback, but changing, semantically, changing the updates, rollbacking the updates. So this figure actually represents the saga better because it is just a serial run of multiple services. Imagine that you are becoming a trip, and when you are booking a trip, you need to first plan the meeting, then book the flights, then book the hotel, and then book the transport. And again, imagine that when you are using these operations, you plan the meeting successfully, then you book the flight successfully, but you had an issue with the when booking the hotel. So the hotel booking will cancel itself and then call the cancellation call of the book flight. So the book flights will be canceled and then the book flights will cancel the plan meeting by calling this compensation action. So this is the simple representation of the distributed sagas. We also need to mention about two approaches. There are two basic approaches on the distributed sagas. The first one is the orchestration, and the second one is the choreography approach. In the orchestration approach, there will be orchestrator to actually orchestrate manage the transaction itself. It's not the same thing in the two phase commit, the coordinator in this two phase commit, it just helps the distribute applications to communicate them, each of them. In the choreography approach, all of the services are responsible for their own, and they most likely use kind of event source or something like that to communicate each other. Yeah. The advantages of this sagas are, as you know, it is running asynchronous and it doesn't need a homogeneous environment. It can run with heterogeneous distributed components. And also we don't need any xa transactions, we don't need any xa specific thing. And also it is scalable because all of the applications, all of the services are running their own transaction or responsible from their own transaction. The disadvantages of the distributed sagas is the eventual consistency. There is no strong consistency in distributed sagas because at some time you can have some sale rates. For instance, imagine that when you have the issue with the banking service, when it has the failure, then you might be reaching the same related records in the user database from a different application, and you will be the updated information at the exact that time. So there will be also no write read isolation. Another issue with the distribution sagas is you need to develop a separate compensation function. You need to provide a separate logic for that one. There's no simple rollback, and you also need to handle the compensation failures. So let's talk about our solution. So we need to first discuss about, we need to talk about Istio, the architecture of Istio. As we mentioned in the beginning, the Istio actually is a service mesh that handles the service to service communication. So you can see that there are multiple pods in this system and there are two planes that we can mention. The first one is the control plane that runs the istio's own applications. And also in the data plane your applications will be running. And in the data plane on each pod there will be a proxy. Istio uses envy proxy to communicate, to hand over the communication. And here you can see that the ingress traffic comes into the applications and the proxy hands over that one and you can again see that there is no direct communication between service a to service b. It will be only happens through the proxies. So the east to west communication will be always happened between the proxies. Another feature that we'll be using is, will be the envoy proxy filter chains. This is a really useful feature of envoy that you can enhance or you can block the data communication. For instance, you can have some HTTP filters that you can add more information, some header, update the headers or change the data itself using these filters. Or you can even block or you can cut the circuit using these filters at any time. Our solution can be demonstrated like this. Again, we will have user service and product, service and banking services. In these pods we will have the envoy proxies provided by the istio service mesh. We will attach a filter to these envoy proxies and these filters will be communicating to our own application which will be calling propeller app. These filters will be providing information about the transaction and this propeller app will just store the transaction information on a hazelcast cluster. So let's see in action. Our application is still running and we can also see the Kiali console. Yeah, you can see that this is the previous run of our demonstration. So the ecommerce service, we access the ecommerce service and ecommerce service access to user, product and banking services accordingly. So here we will first create a hazardous cluster. This hazardous cluster will have three replicas and we will be accessing this hazardous cluster using a hazardous cluster service. And our propeller application will again have three replicas and it will be just storing the data of the transaction data in the hazardous cluster. So first let's start, let's apply the deployment, the configuration for the hazardous cluster. Okay, our hazardous cluster is being created and let's apply the propolar app configuration. Yeah, they are all being created. Let's check the services or pods first. They're all running. Yes, all of them are nearly running. The last instance of the proper rep is also ready. Let's check one last time. Everything is running and you can see two containers in each pod. One of them is the actual pod. The second one will be the envoy proxy. So what will we also do is as we also mentioned in the diagram, we will apply a filter to envy proxy. So let's first apply it and then discuss about details. It will be propeller filter yaml yeah it's created. So let's discuss about the propeller filter. This is the part that the transaction, the saga will be running on. So this is the kind of envoy filter. And this envoy filter is a custom resource provided by the istio and it's an HTTP filter. It is running at the HTTP level and it is inserted before the operation. So it will be running before the operation for each operation on the pods. So it's a Loa filter. You can have multiple options here, but I will be using LoA to implement this filter. So there are two functions, two main functions. One will be running on each request. The second one will be running on each response into the pod. So what will be on the request is we will just get the transaction id from the headers. And from this header we will get this transaction id and send it to the propeller app, which you can see here, there's a put key and value, just an entry. And here is the address of our internal address of our internal domain of the propeller app. The second function is actually on response function. On each response we will check the status of the HTTP status of the response. So if there's a failure on the response, if it is a failure, then we will say that we will call the compensation actions. How we will do that, we will get the transaction id again from the header and we will communicate the propeller app to get the information about the information about the current transaction. We will get the transaction information and for each endpoint we will call the compensate actions for this transaction. And so the compensation actions will be called. So again we will go and check our envoy filters. Yeah, this filter is applied. So let's try to demonstrate our demo again. So again, we have the same application as you most likely notice that we haven't redeployed our application. We have just applied filter. So let's try to. Again, you need to see that our current product stock is seven and available user credit is 7000. So let's try to purchase this product again. For instance, this time I will get four of these products. Anyway, let's continue with this one and talk about it later. And also let's get the 5000 from our user credits. So when I click on buy now, if the distribute stagger works then the product stock and the available user credits shouldn't be changed. It means that the product stock will stay at seven and the amount from the available user credits should be stay at 7000. Let's click on buy now. Yeah, we have the internal server error and when we go back yes, we can see that the distributed saga worked because we have rolled back the updates on the available user credits and we have rolled back the product stocked on the product service. And also we can check here, there's a detailed representation of the operation. You can see that the product propeller app is communicating to Helzikes cluster. The banking services failed. It is still not properly updated because we will see that the banking service will be communicating to proppler app and then this banking service will be also communicating to product service and user service in a few seconds. But let's continue what happens. Let's try to discuss what happens. So the banking service app return that HTTP 500 response and on the my proxy we intercept that response and then we check that since it is at 500 then we access the propul wrap, get the details of the transaction from the hazelcast cluster and then with the information that taken from the cluster we call the compensation actions on each service on which will be the product and user services. So the compensation actions worked and then we roll back the transaction. We can again. Yes, here is the updated version of the it's a pretty complicated one, but you can see that ecommerce service contacted to the profiler app and then it contacted the banking service. Sorry, the banking service contact the proposed app and then banking service get the call to compensation actions of the user service and product service. Yeah, this is the end of my talk and I hope you enjoyed it. Just I need to say that.
...

Alparslan Avci

Software Developer @ Zapata Computing

Alparslan Avci's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways