Chaos Engineering for Resilient Microservices

Video size:

Abstract

Chaos Engineering is the art of proactively breaking things to build resilient systems; my talk will show you how to design microservices that thrive under failure, not crumble.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, everyone. My name is Mohammad Ahmad Saeed, and today I'm going to talk about chaos engineering for resilient microservices. First, let's look at a typical microservices architecture. as we can see in the diagram, we have multiple microservices, and each microservice has a dependency. The two main dependencies for our microservices are the databases, and other microservices. So for instance, in this diagram, we have a microservice one, which is dependent on the database one, and it is also communicating with microservice two via a queue. But if we look at The communication between microservice two and microservice three, it is happening, via a direct HTTP calls. So the point to raise here is, each microservice has dependency. It can either be database or it can be either, other microservices. There are also other type of dependencies, which will, we will look into it later. Now let's look at what are our goals. So we have three main goals here. The first is to uncover weaknesses in our system. it can also include like identifying single point of failures. So what happens if one of the microservice goes down or if the dependency like a database, so if the database, is overloaded, if the database is responding slowly, so What is the behavior of our microservices? So that's the first goal. We have to identify the weaknesses in our system. The next is to improve system robustness and that actually means that we have to make our system more robust and strong and By increasing the stability and keeping the performance within a defined range So the performance of every microservice is doesn't gets worse and worse with time If anything happens in the system the microservice the performance of that microservice It should stay within a defined range. The third is to handle unexpected failures gracefully. So there can be a lot of unexpected failures in the microservices architecture. the most typical ones are the network issues, the dependency, failures, the bad configurations, the data consistency issues. So there can be a lot of, failures, a lot of different type of failures, in the microservices architecture. and In case of any failure, we should ensure that the, that failure is handled gracefully and one failure doesn't take the whole service down. Let's look at why is chaos engineering critical for microservices. So microservices are complex. They have a distributed complexity. Microservices, they have like many moving parts. which can be like APIs, which can be databases, which can be messaging queues, which can be a third party services. So failure in one service can ripple through the system. That's one of the complexity that, that is very necessary to actually, look into. The next is the network dependencies. So As we have seen in the previous slide, microservices, they do communicate over networks and networks are prone to latency, a packet loss, outages, there can be a lot of stuff which go wrong in the network. The third thing is the dynamic scaling. So microservices, they often run in the containerized environment like Kubernetes, that dynamically scale. And that actually makes it harder to predict failure modes. one example is, did this service scale properly when there was a lot of load on the system? the next is the frequent deployments. There is a continuous deployment in microservices, like for instance, microservice one is getting deployed every day. Microservice two is getting deployed every week. And actually that increases the risk of reducing, sorry, the risk of introducing, bugs or regressions. with proper cross engineering, we can make sure that. If there is any failure in the system, it doesn't take the whole system down with it. Let's look at what are the microservices. So the very first is the service failures and the service failures like in, we have to basically simulate the failure of individual microservice. The most common one in every system, at least what I have seen is the authentication service. So as soon as the authentication service is down, it takes. Most of the other services down with it because they are dependent on that service. So here we have to test how the system handles service availability or crashes. So if one service is crashing, then is the other service crashing or is the other service like giving outage or not? Then the network issues. So in, in network issues, we can introduce, latency, packet loss, network partitions, and then we have to test how the system behaves under slow or unreliable network conditions. The third is the dependency failures, independencies failures, we can, like similar dependence, dependency failure, like databases, the message queues, the message brokers, or the third party APIs. And again, the goal is to make sure that any dependency failure, it is handled gracefully. The next is the load and scalability. So. if one of the service is getting excessive traffic, then, does it auto scale and handle the load properly or not? And what is the behavior of other services which are depending on the service? the next is the data consistency. So if, in case of any like data corruption or delays in the data replication, then we have to verify like how the system maintain consistency during failures. The next is the configuration changes. we are here, we are testing how the system reacts to misconfiguration or sudden changes in configurations. Does it propagate and take everything down with it or does it, handle it gracefully? The next is the cascading failures. So we have to simulate the scenarios, where like one failure, it triggers a chain reaction across multiple services. Now let's look at what are the principles of chaos engineering. in Chaos Engineering, the very first thing is to define a steady state. which actually means that we have to identify the normal behavior of the system using metrics. And those metrics can be like, latency, error rate, throughput. In normal circumstances where we are with those metrics, of course, we cannot be at zero like errors cannot be zero. There will be some errors for sure. so we have to first identify the baselines where we are, where the system is currently at. And then the next step is to, then we have to hypothesize about failures. So it's more about like hypothesis. So what do we think that, that should happen when a failure occurs? We can predict how our system will behave And the third step is to actually Introduce the controlled kiosk which means we are going to simulate failures like failures can be like network delays server crashes then we have to simulate this but they should all be in a controlled manner The fourth thing is to observe and measure. After we have introduced the kiosk, now we are checking the response of the system and the most important thing is to ensure that it can self recover because we are doing all of this exercise on production. The next thing is the automate and integrate into CICD. this is a very important step. Once we have some experiments, chaos experiments, we have to make sure that we are doing them continuously. They are happen, happening frequently. And, that's the only way that we can ensure that our service, microservices are resilient because we are testing them continuously in our pipelines. Let's look at the popular chaos engineering tools. the very first and the very the most famous is the chaos monkey. So chaos monkey is It was developed by Netflix and what this tool does is it basically, randomly terminate instances in production. to be like, just to make sure that if, for example, one service is terminated, one service is killed, then what happens to the other, other dependencies, how the system behaves overall. The next is the Kiosk Mesh, Kiosk Mesh is the open source platform. It was built by, it was built for Kubernetes, but it is like really powerful and flexible and it supports diverse, fault injection. And it also has a web UI for managing experiments. So we can manage and run and monitor different experiments through a really UI. The next is the Gremlin. So Gremlin is, it's not an open source, it's a commercial platform, opening a wide range, of attacks that can run against various infrastructure components, including, Kubernetes, cloud providers, and in, it also has a very, user friendly, interface, and it also has like very extensive attack library, robust capabilities and integration with monitoring tools. it's again like very powerful tool and it's, used, across the industry. The next is the Toxy proxy. So Toxy proxy is, mostly limited to the network conditions. it's an open source TCP, proxy to simulate The network and system conditions for chaos and resiliency testing, it, it allows for fine grain control over the network, impairments and, network impairments and can be used like with various tools. So all of these, four tools, these are like really powerful and these are quite famous. Let's look at the real world case studies of chaos engineering and the first example here would be slack as you all know that slack is a very popular messaging platform and it is used by millions of users across the world. So slack. is basically conducting regular chaos experiments to test the resilience of their services and infrastructure overall. And they also have like incident response drills where Slack is using chaos engineering to simulate incidents and then checking the responses to those incidents. By using this chaos engineering to maintain their, and maintain and test their infrastructure. So slack was able to minimize, downtime and, it made, they maintained a ready to rival platform for each user. I think their current reliability is like 99 something, 99 point something. Let's look at another real world case study of Kiosk Engineering and we can look at Azure Kiosk Studio. So Azure Kiosk Studio is built by Microsoft and it is a managed service. It basically, it uses Kiosk Engineering to help measure, understand, and improve service resilience. Here we can set up some experiments, we can configure which faults to run, and we can also select resources, resources to run our faults against. And it is used by, a lot of companies to run experiments and improve the resilience of their services. Let's look at the benefits of adopting kiosk engineering for microservices. So the very first for any business is the financial benefit. in case of a down time, the businesses suffer, the, if the platform is down, let's consider an example of bank. So if the banking system is down, it will disrupt the whole, the, all the customers of that bank and it can result in a significant financial loss. So here. Using kiosk engineering to ensure the stability and resiliency of our services, we can, stop the, financial loss. and the main thing is because we are preventing outages. The next is the technical benefits. So technical benefits, means for example, we have, reduced incidents. We have less firefighting. We have better on call experience. and then, yeah, we have less technical issues in our system. And the third is the customer benefit. So yeah, it's, pretty, obvious that if the system is down, the customers will affect. And, If the customers, they are not satisfied with the platform, they will just jump into another platform. So here we are providing, a better experience, the app uptime, the, all of those, the system is stable, the system is reliable. So if there is no downtime, customer will be satisfied and their experience will be improved. Let's look at the best practices for, doing kiosk engineering for microservices. So the very first is to start very small. So select some non critical services, and then, start, like looking into running the kiosk engineering experiments for that services. I think this is very important because we don't want to take a very important service down. While doing this exercise. So start with very small start with the service which has less usages, which is non critical So once we have selected the service, then we can run the experiment in a controlled manner and by controlled manner, I mean we have to use like feature flags and also We have to ensure that we roll back into a very stable state if, if something goes bad. So maybe we were expecting that, just one, like one dependency, like just in case of a failure of one dependency, it will not have any severe consequences, but the reality is the opposite. So here we have to make sure that we are using feature flags and we are able to roll back if needed. And then we have to. Monitor everything like there are a lot of observability tools available in the market. Like some of them are like prometheus Grafana and elk stack. So yeah, we have to make sure that we are monitoring properly and The next step is the automation. the chaos engineering, should be a regular part of the DevOps pipeline. It should be running continuously. It should be running on the pipelines. It should not be, one in a year or one in a month exercise. It should happen regular, regularly. That's the only way to make it more useful. And then we should have a recovery plan. So we have to make sure then, we are able to roll back and there are some, self healing mechanism exists. And that is basically to, for example, if we are facing a failure, then yeah, there should be a self healing mechanism that exists. Is helping us to recover from that failure And yeah, that's it for today. Thank you everyone. I hope that this was useful Bye

Slides

Download slides (PDF)

See all 31 talks at this event!

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

Chaos Engineering for Resilient Microservices

Video size:

Abstract

Summary

Transcript

Slides

Muhammad Ahmad Saeed

Software Engineer

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2025 - Online

February 20 2025 - premiere 5PM GMT

Chaos Engineering for Resilient Microservices

Video size:

Abstract

Summary

Transcript

Slides

Muhammad Ahmad Saeed

Software Engineer

Join the community!