Conf42 Site Reliability Engineering 2021 - Online

Improving resilience through continuous SLO validation using chaos engineering

Video size:

Abstract

In this session, Uma Mukkara will talk about how Site Reliability Engineers can use Chaos Engineering to do continuous validation of Service-level Objectives and thereby improve the resilience of the systems they are operating on.

Summary

  • Chaos native provides chaos engineering solutions for improving the reliability around cloud native and hybrid services. You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud.
  • In the cloud native space, reliability is a bigger challenge. What is chaos engineering? It's about breaking things on purpose. Practice chaos engineering as an extension to your existing development or DevOps processes. The end result would be you are building a complete digital immunity to your services.
  • Chaos engineering is used to guarantee that no matter what, your service is okay. The best practice of chaos engineering is not just to do it only in Ops, but try to introduce chaos as a culture into your entire DevOps.
  • Litmus has been adopted by some of the big enterprise DevOps teams. You can use litmus to do real practical chaos engineering in your DevOps. Use it in any of the following use cases, SLO validation or management.
  • Litmus Chaos again is a popular open source chaos engineering platform. It's conveniently hosted for free at Chaos native Cloud. The entire set of experiments or suite of experiments are available on Chaos native cloud, or you can host it on your on premise.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer, a quality engineer who wants to tackle the challenge of improving reliability in your DevOps? You can enable your DevOps for reliability with chaos native. Create your free account at Chaos native Litmus Cloud welcome to conf 42 SRE 2021. I am Uma Mukara and I will be speaking today about how do you use Chaos engineers to do continuous validation of slos, the service level objectives, and thereby how you can improve the resilience around the services that you are operating on. Before we start, a little bit about who we are. Chaos native provides chaos engineering solutions for improving the reliability around cloud native and hybrid services. I am Uma Mukkara. I'm the CEO of Chaos native and also a co creator and maintainer of the popular open source Chaos engineering project Bitmas Chaos. So let's talk about the service uptime, which is the primary requirement or delivery of an SRE. Why is reliability so important and why there is such a big function called SRE or site reliability engineering around that? So digitization or digital transformation is a reality today. And the ecommerce traffic is really improving or increasing multifold and retaining the customers is really important. Retaining your users satisfaction is really important. A few transaction drops could cause a lot of deficit in the credibility and you could be losing those customers just because you dropped a few transactions out of thousands or millions sometimes. So in general, in the modern era, the expectations of surveys have increased by the end users. And also we are delivering software faster. So that also really means that the software does not need to be reliable more than ever. So we need faster changes to the software and more reliable services. At least that is the expectation. And the testing mechanisms have also improved. People are doing great amount of testing, good quality testing through DevOps, but that is not sufficient. The proof is that we are continuing to see the service outages now and then. So the idea here is to use surprise testing in production, continue to break things in production so that you find the weaknesses early and you fix them so you continuously improve your reliability of the service. These are some of the example companies that are using chaos engineering to improve their service reliability. And in the cloud native space, there is another reason why reliability is a bigger challenge. In the traditional DevOps you build and ship at certain interval and the same is being done much faster. You build fast, you ship fast in cloud native space, and also the number of microservices or the number of application containers that you need to deal with has improved by many folds. What this really means is you are getting more binaries or more application changes into your service environment, and you are also getting them faster. So the chance that something else can fail and that can affect your service is more now, in fact multifold more, maybe ten times more, 100 times more. So the reliability is a bigger question, what happens if something else fails? Will I continue to work properly? That's the application or a service question that we'd be asking in cloud native space. So, to summarize this, the application, the microservices app that you're developing in cloud native, is less than 10% of the code in your entire service. Your service depends on other cloud native services, other cloud native platforms such as kubernetes and the underlying infrastructure services. So you need to be validating not just the negative scenarios within your application and the functional scenarios, but also will your service continue to run fine if there is a fault happening on any of the other 90% dependent services or software slo? That's the bigger challenge that we are dealing with cloud native, and the answer is practice chaos engineers. What is chaos engineering? I'm already doing some testing. I'm doing some negative testing, failure testing it is called, or failed testing mostly that's really about the application related negative scenarios we are talking about introducing chaos testing, which is the dependent component failure, how those will affect your application service. So we are saying that you power up your product engineering with addition of chaos engineering, and it is always an incremental process. You cannot change the reliability of the system through chaos engineering in a quarter or in two quarters. It's an incremental process, it's a continuous process. And just like any other engineering process, you need to practice chaos engineering also as an extension to your existing development or DevOps processes. And the end result would be you are building a complete digital immunity to your services. Whatever happens, whatever fails, I will continue to run fine. That's the promise you are trying to achieve. So if you want me to summarize, what is chaos engineering? It's about breaking things on purpose. In DevOps, it could be in production or pre production, or at the time of development itself. But trying to do this as an engineering process is what makes chaos engineering very, very effective. You try to cover the entire chaos or fault dependency scenarios, and you design chaos experiments and you try to automate all this chaos workflows and you try to collaborate with your DevOps. If you are can ops, you try to collaborate with Dev and vice versa, and you try to integrate chaos engineering into your existing tools, and that becomes a complete chaos engineering and it will incrementally result in better metrics related to service operations. So this is a very simple way of saying how to do chaos engineering, right? And the below section talks about how do you do that in cloud native. But it's also about introducing a fault is one way to say chaos, but chaos engineering is is my service level objectives are continuous to be met. That's the real end goal. If your slos are being met then you're good. If not, then there is a problem which you need to fix. And how you would try to do this in cloud native is try to follow the same principles that you do follow for your application, development and operations using operators and custom resources. One example that I will be talking little later is litmus chaos. Litmus chaos follows this approach to do chaos engineering in a completely cloud native way. So chaos Engineering summary is if you have any challenges related to service failures, or being able to reproduce a service failure, or unable to recover from a service failure fast enough, then you probably need chaos engineering. And if you invest in chaos engineering, these are the benefits or return on your investments, and you will have faster way to identify a scenario or faster way to inject a failure or identify a failure scenario. You will have reduced MTTR and better MTTF. You will have increased distance between the failures and that's exactly what you want. You want less outages. So what are some of the chaos engineering use cases? We would have heard of game days where try to go and try to surprise the operations team through some game days. And that's generally how you start chaos engineering. But once you buy in the benefits of chaos engineering, you try to introduce chaos engineering into your dev pipelines, CI pipelines, or into your quality engineering pipelines or test beds. But if you're looking at Ops as an SRE, you will use chaos engineering for continuous validation of your service level objectives. That's really the goal as an SRE that you would look for. Let's look at what is a service level objectives and what is service level objective validation really means in a bit more detail. So if you ask me what is SLO or a service level objective is, it's really as simple as just to tell is my service operating optimally, correctly as expected, and typically slo are observed. You have good monitoring dashboards, monitoring systems, and the answer you're trying to get there is, is my service running optimally currently at this time? Can I say it? And also about history, was there a problem last day, last week, last month, how my service has been performing in the recent past? Right so that is Slo observation. Then what is slo validation? Slo validation is really about so far so good. But how will my service be in the next minute? What happens if something goes wrong? Can I guarantee my service will stay up? Right. So that's kind of validating SLO. So how do you do that? You try to continuous pull in some fault against your service. A dependent failure will be scheduled and then validate if your service is continuous to perform or your slo is met. So that's the idea of validating an SLO and making sure that my service will continue to run fine no matter what happens. So in other words, chaos engineering will be used to guarantee that no matter what, your service is okay. And the best practice of chaos engineering is not just to do it only in Ops, but try to introduce chaos as a culture into your entire DevOps. Of course, there will be some initial inertia from various segments of your DevOps, but the idea is, once through game days and a significant introduction of the benefits lectures to your organization, you would be able to convince that chaos engineering is in fact, good practice, is a good practice, and you can start introducing into your quality engineering efforts your pipelines and so forth. And of course, on the operations side, you will try to do continuous validation of slos through chaos engineering. So the typical process is. The typical process is you find a suitable chaos engineering platform. Don't try to do everything on your own, and there are tools available, platforms available for you to get started very quick. Much of the work is done by these platforms. And then you start spending time in identifying the chaos scenarios, start building them properly, design them properly, implement them properly, and then start automating. And then as you start automating, you will find more chaos scenarios or need for more chaos scenarios. And then the DevOps process will kick in for chaos engineers as well. So, in summary, the idea of improving reliability is take an approach of chaos engineering underneath and start improving your reliability in incremental steps. So let's look at one such platform which will help do chaos engineering. And I am a litmus chaos maintainer. Of course, I'll be talking about Litmus here, and Litmus has been there for about four years now, and it's been adopted by some of the big enterprise DevOps teams. It's been in good usage. There is continuous downloads of about 1000 plus every day. Litmus. That shows that there are many people using it on a daily basis in the CI pipelines and so forth. And more importantly, Litmus is very stable with 20 general availability done recently, and it's got lot of experiments readily available through Chaos Hub, and it has a very dynamic community with lots of contributors and many vendors coming in to add significant features and so forth. Slo, you can use litmus to do real practical chaos engineering in your DevOps and overall you have something called chaos center in litmus. That's where the team, the DevOps Persona, whether you're a developer or an SRE or a QA, can come in and try to design develop a chaos workflow or a chaos scenario into your chaos hubs, your privately hosted chaos hubs. Or you can pull the chaos experiments from public hub also if it's accessible from your environment and you end up designing implementing a chaos workflow and it can be targeted against various types of resources, including cloud platforms, bare metal or vmware resources apart from Kubernetes itself. So the typical approach is you have a lot of experiments of various types and you have a good SDK as well. If you want a new experiment, you can write, and if you have a Chaos engineering logic, chaos experiment logic already, you can pull in through a docker container and push it into chaos engineering Litmus chaos platform very easily. And you use these experiments to build a litmus workflow and you write the steady state hypothesis validation using litmus probes and use it in any of the following use cases, SLO validation or management, that's what we just talked about. And continuous chaos testing in your quality engineers or game days or to validate your observability system is working fine. And this is another very important use case that I have seen people using chaos engineering for. You have great investments in observability and you don't know whether those are going to be helping you when there is a real service outage. How do you know that you got everything that you're going to need when there is a failure? So why don't we introduce a failure and see, and continue to introduce failures and see if your observability platforms, your investments, are yielding the right returns. And also many of us will have skilled testing or performance testing. Try to introduce chaos and see things will be okay there or not. So these are some of the use cases that you can use chaos engineers for. So let's look at how chaos engineering happens with litmus. You have chaos center, as I said a little bit before, and your goal is to write fault templates or chaos engineering chaos workflows into a database. It could be a git backed database or the database that is provided by Chaos center itself, like MongoDB or Percona, and you will have certain team of people who will be writing both chaos experiments or chaos workflows, and there will be a certain set of members who would be either just viewing what's happening or scheduling such chaos workflows. So Chaos center allows everybody to collaborate and work together like in any other typical dev environment. So once you have the fault templates, you're going to schedule them against various resources, and you're going to validate resilience, and you're going to generate reports. And more importantly, Litmus also has additional advanced features like auto remediation. If the blast radius happens to be more, or your chaos is getting out of control, you can take remediate actions through litmus. Also we have command probes that are written during chaos or post chaos. You can take any action that you want as an action, you would be initiating a remediating task here to control or bring back the services quickly. So the typical process to summarize is you introduce the platform, you develop scenarios, chaos scenarios, you automate them, and once you find it beneficial, a particular experiment, you put it into your regular QA as well. That is shift left chaos testing. What do you approach and where do you start? Chaos engineering, which stack is typically you start with infrastructure layer, and it's easiest most of the times. And then you go into your message queues or proxy servers like Kafka et cetera, the middle layer API servers. And then you get into your databases, stateful applications, and then you will have your actual application layer itself. And let's look at how Slo validation happens in litmus chaos. So Litmus has a Lego block, as I call it. You have an experiment. The experiment has two parts, chaos experiment or litmus experiment. One is about the fault itself. How do you declaratively specify a fault, how long the fault should happen and what are the parameters of the fault? And then probe is the steady state hypothesis validation. So what can I keep observing before the chaos, during the chaos, and after the chaos? And there are multiple types of probes that litmusk gives and together is what makes it as a chaos experiment. And if that you consider as a Lego block, which is declaratively very efficient for you to tune a given fault and a given steady state hypothesis validation. You have many such lego blocks or chaos experiments in Litmus Chaos hub. You use them to build a meaningful chaos scenario, like a Lego toy, and you schedule that. And once you schedule that, the state hypothesis validation is already built in into the workflow and you just observe the result. If the resilience score provided by litmus workflow is good, that means you're good and your service is continuous to do fine, else you have an opportunity to go and fix something. So the summary for SREs here as far as the SLO validation is concerned, take a look at the chaos coverage across your service stack and try to design and implement chaos test across the service stack and try to schedule them with a surprise, right? So you have to continuously do them with some randomness on what gets scheduled. You got tens or hundreds of chaos scenarios. You don't know what is going to get scheduled, so that's the surprise. But definitely something is going to get scheduled. A fault is always happening and you are continuously validating, and if it is validated, that's exactly what you want. Your service is good. If it's not, that's also good news. You found a weakness and you're going to fix it. So that's about how chaos engineering can be used for continuously validating your service resilience and how can you get started? Litmus Chaos again is a popular open source chaos engineering platform. It's conveniently hosted for free at Chaos native Cloud and you can sign up and get started. The entire set of experiments or suite of experiments are available on Chaos native cloud, or you can host it on your on premise. We do have an enterprise service ring where you get enterprise support with some additional features as well. So with that, I would like to thank you for your audience. You can reach out to me at my Twitter handle. Thank you.
...

Uma Mukkara

CEO @ Chaos Native

Uma Mukkara's LinkedIn account Uma Mukkara's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways