Conf42 Chaos Engineering 2024 - Online

Embracing Resilience: Unleashing the Power of Chaos Engineering in CD Pipelines

Video size:

Abstract

Unlock resilience in your CD pipeline! Join our talk on Chaos Engineering—proactively injecting controlled failures to fortify systems, identify vulnerabilities, and foster a culture of continuous resilience. Elevate your software delivery game with chaos.

Summary

  • Come to the session on unleashing the power of chaos engineering in CD pipelines. Sarthak and Saranya are senior software developers at harness. Sarania will show a demo on what are the strategies that we follow. I hope the agenda is interesting enough to keep you with us till the end.
  • Chaos engineering involves intentionally injecting controlled disruptions into the systems. This process allows teams to identify weakness and vulnerabilities before the system reaches the production environment. Adding Ko step in the CD pipeline ensures continuous validation of your system's resilience and reliability.
  • Saranya: I'm going to explain how we can add chaos as pipelines as a step in. This is an online booty cat shopping application which has a microservice based architecture. For the demo purpose, I have just added sleep of ten minutes, 10 seconds. You can add your own commands as per your own requirements.
  • booty cat cpu hog experiment puts a load on cpu in the target cluster where the application is deployed. This is the expected resilience course. What is the minimum resilience code that we expect the application to have. After multiple runs, we can improvise based on like after multiple runs.
  • We have this cure step, but where is the rollback step? That means what strategy to adopt when this particular step fails. Here also you have other options like mark as access, retry and all other things. You can also add other steps in parallel or in serial.
  • Kyostep can be easily done with natively using the harness kiosk and harness CD pipelines. Instead of manual intervention, it can also be triggered using the webhooks. And I hope we have convinced you enough to add one more chaos step into your pipelines and ensure the continuous resilience of your application.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, welcome. Come to the session on unleashing the power of chaos engineering in CD pipelines. I'm Sarthak and with me is Saranya. We both are senior software developers at harness and also maintainer of Litmus Chaos which is an open source tool that allows you to practice chaos engineering. We have attached our LinkedIn and Twitter profiles just in case you want to reach out to us after this session. So agenda I'll be talking about Kios in CD pipelines. I know it is something that not many organizations practice and is new for many SREs and DevOps engineers. There are organizations who don't even practice Kios engineering as standalone task and they're surely missing out on a lot because of this. I'll also share some interesting stats that I've copied from Google which may get you thinking, followed by which I'll tell you why Kios is important and how it can help you make a better product. And at the end Sarania will show a demo on what are the strategies that we follow and how we do it at harness to make our product resilient. I hope the agenda is interesting enough to keep you with us till the end. All right, let's get started. So I know this is the honest reaction of DevOps engineers and SREs who are peacefully running their CD pipelines. I know you guys have done a great job building a deployment pipeline for your application, but trust me, adding a chaos step is worth it and I bet you guys will be convinced by the end of this talk. So chaos in CD pipelines in simple terms, means adding disciplined chaos in your pipeline and check how the system reacts to these disruptions. So the goal is to ensure that the CD pipelines remains stable, reliable and capable of delivering software smoothly even if unexpected challenges arises. Let's see some interesting stats here. So downtimes are expensive the Gartner it key metrics data report states that average cost of downtime is around five point six k dollars per minute, which is a lot. Not only the cost, but customer impact as well. So poor system reliability and unexpected failures can lead to a decline in customer satisfaction and loyalty. According to Zendesk, 39% of customers will avoid using a product or service after a bad experience for obvious reasons. So now the question is how to improve these numbers. And the answer is quite simple, make your system resilient and improve MTTR. That is the mean time to repair. So now chaos engineering can help you achieve this and injecting chaos in your CD pipelines can help you automate the process. Let's see why chaos step in deployment pipeline and how is it useful. So the first one is early issue detection. Chaos engineering involves intentionally injecting controlled disruptions into the systems. So this process allows teams to identify weakness and vulnerabilities before the system reaches the production environment. So by identifying and addressing issues, early teams can prevent these problems from causing any significant outages or failures in the live production systems and avoid its impact on the users. So let's say for an example, suppose a chaos experiment reveals that a certain component of the system becomes unresponsive under high load conditions. So injecting this early, let's say, in QA, will allow the team to optimize the component before it impacts the end user. And this is a very common situation. Next, we have enhanced resilience, let's say improved resilience. So anyone who has read about chaos engineering, so anyone who has read about chaos engineering must know that this is the primary aim of chaos engineering, that is, to make the systems more resilient. Chaos experiments are designed to test how well a system can adapt to unexpected conditions, and it continues to function without causing any major failures. So by deliberately introducing chaos and monitoring system behavior, teams can implement strategies that enhance the system's overall resilience. So let's say, for an example, if a chaos experiment simulates a sudden increase in traffic or server failure, the team can use the insights gained to implement auto scaling mechanisms or some other solutions to tackle this. Then we have improved incident response. So by intentionally introducing chaos in your CD pipeline, you create controlled scenarios where things can go wrong. So this will provide an opportunity to identify weakness in your system's response mechanism, identify the and identify incident handling procedures. So let's say for an example, consider a scenario where you simulate a sudden surge in user traffic during the ko step. So this allows your team to assess how well the system can handle unexpected spikes and how quickly can it scale resources to maintain performance. So any issues discovered during this scale scenario can be addressed which will lead to an improved incident response. When similar situation occur in a real world scenario, then we have continuous validation. For me, this is the most important thing. Adding Ko step in the CD pipeline ensures continuous validation of your system's resilience and reliability. So regularly testing the system to controlled chaos will help in validating that it can withstand unexpected events and disruptions throughout its life cycles, not only once, but throughout its lifecycle. So practicing chaos engineering once or twice and thinking your application is resilient is like hitting the gym on a new year and thinking that you will be fit for the rest of the year. You know, that's not how it works. So continuous validation ensures that you don't just test for resilience once and forget about it. Instead, you regularly subject your system to different chaos scenarios, and you validate that your system adapts and responds well to these challenges. So it's like giving your system a regular workout to ensure it stays in top shape, reducing the risk of unexpected failures when it matters the most. So when you deliver your product to the customer, you are confident that we have done all the amount of testing, and it's not going to cause any disruptions, and it will be a very smooth experience for the customers. Then we have increased adoption of chaos. So even though chaos is not a new term anymore in the industry, and I'm pretty sure that whoever is seeing this session is very well aware of what chaos engineering is now, which was not the case maybe three or four years back. So even though chaos is not a very new term in the industry, but there are teams and organizations that are reluctant to use it in their systems. Adding and automating chaos through CD pipelines and testing out with simple chaos experiments will help them gain confidence and allow them to adopt two chaos engineering practices. So, by gradually injecting chaos scenarios into the CD pipeline and noticing the positive impact on system reliability, the team becomes more comfortable and confident in adopting chaos engineering practices. So this increased acceptance will lead to a culture where chaos is seen as a means of continuous improvement rather than a potential risk. I know this fear is still there in the market that chaos can cause disruptions, and our system may not be healthy enough to handle all these things, and we don't have to do chaos testing in our system, but it's not like that. So, just to start off with, you can start with some simple chaos experiments, integrate it in your CD pipelines, and you will see the results. So that's how it works. So these are the five important reasons which you can get benefit from if you integrate chaos in your CD pipelines, right? So just a meme here. No developer were hurt in making this meme. It's just a meme to lighten up things. So, enough of theory now. Let's get to the demo. Time for the demo. Hope the demo gods is with us. Over to you, Saranya. Thanks, Arthur. And hey, everyone, this is Saranya, and I'm going to give a brief demo or more of it. I'm going to explain how we can add chaos as pipelines as a step in. CD pipelines so in the name of the demo gods, let's get started. So before going to the pipeline itself, I'm going to explain the environment. So this is the application that we are going to that is being deployed. And as you can see, this is an online booty cat shopping application, demo application, which has a microservice based architecture. It has various microservices such as cart service, checkout service, and a currency converter service. So multiple services are there. So basically you can do a very basic functionality of online shopping. So you can change the quantity, add it to the cart, and you can place an order. So this is the application that we are going to deploy. And here, let me show you CD pipelines. So this is the pipeline that we have created. And the first step is obviously the deployment step. This is the rollout deployment, and next step is the observed deployment step. So in this step we generally observe the application, how it behaves, like if it's healthy or not. If it's not, then we can simply roll it back, otherwise it will go further with the cure step. So for the demo purpose, I have just added sleep of ten minutes, 10 seconds, and you can add your own commands as per your own requirements. So that's that then, coming to the chaos step. So here I have added this, so as chaos experimentation step. So we have added this booty cat cpu hog experiment. So what it does, it generally puts a load on cpu in the target cluster where the application is deployed. And this is the expected resilience course. I'll come back to this in a while. And this is the chaos infrastructure detail that the target infrastructure, where the application has been deployed, and the namespace and default is pod cpu hog. So, coming to the expected resilience score. So let me give you a quick refresher on how chaos engineering is generally carried out. So the very first step of the chaos engineering is, first of all, we have to identify the steady state, continuous, the steady state hypothesis. That means how the application behaves when it is healthy. So first you need to identify that, and then we introduce a fault, and then we check if the slos are met or not. If yes, then the application is resilient. Otherwise the weakness is found and we have to improve upon that and then run this complete cycle again and again in connection to this first step. That is the steady state, the steady state hypothesis. We have this expected resilience code. That means what is the minimum resilience code that we expect the application to have, so that the cure step is considered as successful. So you can give it any number. But to give an idea on how we can decide upon this number, if you click on here, you can see as these are all previously run experiments, you can get an idea of the expected score from this last resilience score. For this, let's say it's 100 and for lambda function timeouts. So for this experiment it's 50%. So according to these last run, last resilience for values, we can decide and then we can improvise based on like after multiple runs, we'll obviously get an idea what value to set there. So yeah, that's that. And then coming to this particular step, the diagram that Sarthak Jain already mentioned, you must be wondering, where is the rollback step? We have this cure step, but where is the rollback step? So if you click on this particular cure step, and if we go to this advanced section here, we have this failure strategy. That means what strategy to adopt when this particular step fails. So in this particular pipeline, we have chosen the rollback stage. So in case the step fails, it will basically simply roll it back to the healthy deployment. So if I come here, this is a different pipeline where I will be able to show how to add. This is the cure step. And if I go to advanced section, in the failure strategy here, you can see we have this rollback stage and we can also have other options as a failure strategy that is like manual intervention. If you want to do a manual intervention after post, like there's a timeout after that, some of these market success or ignore it, something like that will happen as per your own requirement. So in QA you can simply ignore or abort or mark it as success. But in prod environment it is advisable to roll it back to a safer deployment. So here also you have other options like mark as access, retry and all other things are there. So in addition to this, I wanted to also, just wanted to let you know that here you can also add other steps in parallel or in serial, so you can add a cure if I click here. So you can also add any other step or a cure step in general. So you can choose an experiment and choose the expected resilience score and add it here like this. So coming back to our original demo pipeline, so as here, you can see I think we are clear about this particular pipeline, and due to time constraint, I won't be able to run it because it will take quite some time, but I will be able to explain the failure and success cases from the execution already executing executed pipelines. So we have run it multiple times. So let me first go to this failed pipeline. Yeah, so here you can see this Kio step has been failures because the expected resilience score here is 90. But the resilience score we got after the execution is 66. That's why this particular step failed. And as a result of that, the rollback step got triggered. Because I already showed you that we have chosen the rollback strategy as a rollback, as a failure strategy. So rollback step got triggered and posted. A health check to ensure the deployment is healthy has also been executed. So this is what happens in case the cure step gets failed. And other than that, let's go to the success one here if I. Yeah, so here you can see this pipeline chaos been successful. Pipelines execution has been successful because the expected resilience score is 90, whereas the actual resilience score is 100. But if you want to see this particular chaos execution step in detail, you can just click here. And this brings us to the chaos execution step. Here you can see the steps that like first of all that is the install cure step. Then the actual fault execution here you can see all the required probes here. First one is the card service availability check, then booty website latency check, and the pod status check. So all these probes have been passed, resulting in the score of 100%. And if you wish to see the logs, logs are also available here and fault configuration can also be found out here. So this is how you can see the chaos experimentation in detail. And before wrapping this up, I have two more things to share if I go back. So the demo I showed, like whatever I explained here, this can be easily done with natively using the harness kiosk and harness CD pipelines. But if you want to integrate kyostep externally, the APIs are already available which can be used. So for example, we have integrated it with GitLab. Here you can see the same step, the deploy step, the ko step, and then in this case it's failed. So the rollback happened. So similar thing can be done using GitLab as well. So if you click here, you can get all the details of this particular step, like the logs here, you can find out why it's failures and all. So this can also be done by using the aps that are available. And one last thing is if I go to the pipeline, in this case, whatever I showed, I am able to trigger the pipeline manually. But this can also be done, it can be triggered automatically based upon some webhook, based upon some continuous that can be done using the webhooks. So here you can see it got triggered by one such webhook, that is card service deploy changes. So in this case, in case there is some changes in the deployment, the pipeline will automatically get triggered and you can see the details execution. So in this case it got passed. Instead of manual intervention, it can also be triggered using the webhooks. So yeah, that's how you can integrate chaos as a CD step. As a CD pipelines step. And I hope we have convinced you enough to add one more chaos step into your pipelines and ensure the continuous resilience of your application. And with this, I would like to thank you for watching us till the end.
...

Saranya Jena

Senior Software Engineer @ Harness

Saranya Jena's LinkedIn account Saranya Jena's twitter account

Sarthak Jain

Senior Software Engineering @ Harness

Sarthak Jain's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)