Chaos Experiments under the lens of AIOps

Video size:

Abstract

Imagine this: you’re a Site Reliability Engineer (SRE) at a major tech giant and you are responsible for the overall system health, which is running in prod. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities, which sometimes simply feel like a ticking time bomb. These are just some of the daily struggles an average SRE needs to go through. But why should it be like that? Well, it shouldn’t - thanks to a term coined by Gartner in 2016. AIOps, meet audience. Audience, meet AIOps.

Let’s extend this scenario. On top of all of the above mentioned issues, our poor SRE needs to watch out for potential security breaches and make sure nothing ever gets in through the cracks. However, by conducting proactive experimenting, continuos verification and improvement, he makes sure that the system is able to withstand these turbulent and malicious times that we’re living in. Do these notions ring any bells? They sure do! Chaos Engineering, meet audience. Audience, meet Chaos Engineering.

What’s our angle, you’re wondering? AIOps and CE are two concepts, which are often kept separate. In this talk, we will discuss (and show you!) how both practices combined can significantly increase cyber resiliency, while at the same time maintain full E2E transparency and observability of your entire system.

For this session, we have prepared and analyzed several use cases, followed main principles, summarized best practices and prepared a live demo through a combination of CE and AIOps tools.

Above all, we are SRE Engineers. As such, during this session, we will stay close to the SRE principles and best practices that we used to achieve our goals, e.g. reduce organizational silos, measure everything, learn from failures, analyze changes holistically, etc… As we proceed with our talk, the audience will be able to identify how these are related to AIOps, as well as CE, and finally, how it all ties together.

Summary

Jamaica makes up real time feedback into the behavior of your distributed systems. observing changes, exceptions, errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. The highlight of this session for many will be our live demo.
Aiops is a term used to indicate the application of ML analytics to it ops in order to prevent system degradation and failure. And then we have silos engineering, which is all about experimenting and testing our system's resiliency.
SRI runs chaos experiment on online boutique. Is aiops capable of recognizing a running chaos experiment through observability? The experiment will then either fail or it will be under control.
In this experiment, I want to inject latency through litmus chaos into my card service microservice. I also want to see if slunk will be able to detect it. Let's start an experiment.
Let me go into litmus chaos. You want to create a chaos workflow. We want to inject network latency. Also, never forget to click here in advanced options and to enable cleanup chaos. This basically cleans up the chaos and restores your environment after the chaos experiments are over.
Aiops can leverage observability in order to identify when a chaos experiment is running. With every experiment, trust in the system is built. Start simple and scale fast.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Jamaica makes up real time feedback into the behavior of your distributed systems and observing changes, exceptions, errors in real time allows you to not only experiment with confidence, but respond instantly to get things working again. Close hello and thank you for watching today's session. Chaos, chaos, chaos experiments under the lens of my name is Michaelem. I'm an SRE DevOps engineer, and I have a background in software engineering, AI, industrial automation. So naturally, I would say all of these areas pointed me towards aiops, which is my current specialization. This is our agenda for today. I'm going to define the basic concepts surrounding aiops, followed by an in depth view of how it can be applied to chaos engineering. I think the highlight of this session for many will be our live demo, after which we'll wrap up and summarized everything that we've talked about here today. Before we start, let's set the scene. These are our two focus concepts. Aiops, a term used to indicate the application of ML analytics to it ops in order to prevent system degradation and failure. In simple words. And then we have silos engineering. We'll use CE abbreviation throughout this presentation, which is, as you know, all about experimenting and testing our system's resiliency. Now imagine this. A hardworking SRE site. Reliability engineers wakes up one day, thinks about all the tasks that await him. So many tools, tickets, incidents, mail. Basically an avalanche of responsibilities. That sounds quite overwhelming, right? I think, to be honest, all of us sometimes have a feeling like that. But then, as he sips his morning coffee, can idea comes to his mind. Why the hell should it be like that? If only there was a way to smartly and efficiently organize and automate all these tasks. Well, there is one. Which is why we'll start with today. With today's rapidly increasing and more and more complex stack systems have become extremely critical. As I mentioned, can average SRE today needs to deal with tons of errors, warnings, tickets, critical alert? It just doesn't end. Doesn't it? So how can we help our poor site reliability engineer with his daily struggles? Well, for starters, let's introduce AIOPs artificial intelligence for IT operations. But what is exactly aiops? How does it work? We can segment it in three areas. Observe, engage, act. To the left, we see the ingestion of historical and real time data in the form of logs, metrics, text into an ML model, which ultimately produces actionable insights. And that means anomaly detection, performance analyzed, and so on. So, in simple words, we want to predict, or even better, we want to preempt failure before it even occurs. AIOPs collects all kinds of data network application storage. As we said, the goal is to predict failure, identify the root cause error, and reduce alert noise. Furthermore, autoredemiation and adapted self healing are also important concepts which refer to the ability to resolve a failure before its occurrence. That means to enable self healing before a problem occurs. Basically, the paradigm, if you think about it, is shifting here from reactive to proactive. We don't just detect errors, we prevent them. Also, remember that the AIOPS model is continuously collecting data and continuously learning from it, and therefore continuos optimizing itself. But let's extend our scope a bit now. How do we experiment? Inside aiops? We have our aIops solution, but now we want to test its resiliency, robustness and reliability. How do we do that? Let's read two the following definition a discipline of performing security experimentation on a distributed system in order to build confidence in the system's capability to withstand turbulent and malicious conditions. So what is this? As we already know, this refers to chaos engineering. We want to test our AOP solution by conducting chaos experiments. Now we need to ask ourselves, how do we plan chaos experiments? So this diagram here illustrates the continuous cycle of hypothesis and verification. What is this? We have these steady state posture which we get through observability. Don't worry, I'll get to this concept soon. We form a hypothesis, sort of is my system resilience to the disruption of xy services? This is an example hypothesis. We put it to the best through continuous verification, at the end of which we summarize the lesson learned and implement mitigation. This is a continuous cycle, as at this point we start again with the whole process to further experiment our system. We talked about aiops and c chaos engineering, but where do these meet? There are several things, but for today's scope, the most important one is observability. Were are a couple of notions relevant for observability. We have different sources of data illustrated to the left side, and basically, observability is the ability. Two measure a system's current state based on the data it generates, such as logs, metrics, and traces the so called golden triangle of observability. While the golden triangle signals are latency, traffic errors and saturation. In our case, as you will soon see, latency will be the one that we will use for our upcoming live demo. So let's get started. So far I think we learned about aiops, we learned about CE, as well as we learned about observability. So what now? Now we can run a chaos experiment. There are two questions running on SRI's mind. One, how does aiops react if I run a chaos experiment? Two, is aiops capable of recognizing a running chaos experiment through observability? Now that we have all the necessary munition to formulate youre hypothesis, which goes as follows, we assume that if we run a chaos experiments, aiops will be able to detect that there is one. The experiment will then either fail or it will be under control. These are the possibilities. Just a quick overview in terms of our architecture, youre online boutique, which is a mockup web store composed of eleven services splunk observability, a platform for endtoend monitoring which monitors our boutique locust load generator used to simulate active users on our boutique, which are navigating and clicking all over the place. And then we have litmus chaos, which will introduce chaos into our system into our boutique. As we said, latency will be our target. Now, while the observability component collects data from the boutique via the open telemetry collector. As we can see, it is the aiops component that sits on top of splunk observability, and that component is responsibilities for detecting and predicting an increased latency. Well, I think we're all set now. Let's go. This is my environment. I'm running everything in minicube. This here is my online boutique store, a cloud native microservices demo app. What I want to show you quickly is that it consists around, as we can see here, around ten microservices simulating a web based ecommerce application. And it's all running inside my mini cyber. Also this is, if you're interested, the GitHub project for the boutique store. I personally find it pretty handy. So on the other hand, we have splunk observability, which is a platform that provides monitoring across infrastructure, apps, user interfaces. So it basically provides end to end monitoring for the entire system through its entire lifecycle. What we can see here is a these view of our infrastructure and all of youre microservices in the boutique store. What we see here marked in red, it means that there's probably already an issue that's been detected. So what is but doing here? It's basically trying to identify the root cause of the issue. It starts with some issues on the front end and it tries to track it down all the way to the root cause, which is the payment service. What does that mean? That means that perhaps our users can browse through the catalog and put items into baskets, but might have issues while proceeding with payment it's important to mention that the AIOPs component licensed blanc observability two, which gives us the opportunity to apply AI ML data analysis in order to predict all sorts of events such as failure, system degradation and so on. Another relevant component running in our minicube is locust, and locust is an open youre load testing tool which I'll use to simulate proactive users in my boutique. I can easily just choose the amount of users I want two simulate, so in this case 30 spawn rate is two. And the moment I click transforming it will basically start simulating all of these users into the boutique store. The last relevant component I must show you is Litmus chaos. So this is also an open youre platform. It's a chaos engineering platform that we use to introduce some chaos. So now that we've made a summary of all of our components, let's start an experiment. What I want to do is I want to inject latency through litmus chaos into my card service microservice. Basically I want to see if slunk will be able to detect it. What they created here is a very simple dashboard that's just tracking the latency. As you can see here, youre can set also the time span that you want to focus on past day, past week, past hour. And just to show you how I created the alerts condition, how it works. So right now I specified sudden change, which as it says here is useful for indicating an unexpected increase in latency. But I could have also easily chosen historical anomaly. What this would do would basically use the latency patterns from the past, if they are patterns and existing patterns, and use that in order to detect and predict if something is off or not. So I think were ready now to start the experiment. Let me go into litmus chaos. So you click here on litmus workflows. You want to create a chaos workflow. So you click schedule a workflow, select the agent. We want two create a fully new workflow, although you could in theory also use a template if youre have one. Here we click add new experiment. So as you can see here, you can inject all kinds of chaos. So container kill, cpu hog, network close. But whats we will do right now is we want to inject network latency. Before we proceed. We need to tune this a bit. So I need to specify the target here, which is card service. This is the name of my microservice. We don't have any probes right now. The duration of the chaotic experiment, let's say 200 and the network latency let's make it 4000 so we can detect it. Also, never forget to click here in advanced options and to enable cleanup chaos. This basically cleans up the chaos and restores your environment after the chaos experiments is over. Now that we've set this all up, we can proceed. We are scheduling it right now. Click on finish. And now let's see the workflow. So now what he's doing is setting up the chaos environment, after which he will start conducting latency issues. So what we can see here, what we will see here, and maybe we can already see it actually, is that in locust we have all the requests that our simulated users are executing, as well as the failures here, which will increase even more after these cows is injected. Okay, you also see here the percentage of failure, which is something pretty handy. Now let's go check out if our dashboard, if our splunk observability platform detected anything. And yeah, actually, as you can see here right now, this is these latency dashboard we can see this red triangle already indicates that there was an alert. Let me quickly dig through my emails to see if I've received something. Yeah, I did. So as you can see here, splunk observability critical alert, it says latency. The latency in the last eight minutes is more than three deviations above the norm. So it basically alerted me that there has been an increase in latency. So if we look here at the graph, you'll notice that before we have this peak here, before we have this significant increase in latency, which is larger than anything, before we notice that the alert comes actually before the increase. So what we're using here, we're using our AIOps model to basically predict when there is something off in latency before we actually have the error. That way we can actually even prevent it. In reality, we can do even more than just alerts. So let me show you the settings. If you go here, two alert message. You can actually specify here, runbook. This is pretty interesting because within this runbook you can give splunk observability some actions to do to remediate this issue. For instance, it can rebuild, it can reset the node that's failing or that has issues. So this is actually what we're talking about when we're saying that we're shifting from a proactive to a predictive paradigm. So if you remember, we posed a question earlier, and the question was, how does AIOps react? If I run a chaos experiment, as we can see in this case, it detects latency increments and promptly alerts me. So let's put a checkmark on that. The other question, if you remember, we posed was, is aiops capable of recognizing a running chaos experiment? For that, I've but another really simple dashboard. So this dashboard basically contains a counter for every time the specific pod is launched within the litmus mailing space. And this ultimately shows me every chaos experiments that was running. So here I said in the last week, it gives me and detects every chaos experiment I ran, and it also gives me account. So, in conclusion, we've proven our hypothesis as AIops is actually capable of detecting a running chaos experiment. Okay, let's wrap everything up. We talked today about aiops, which is necessary to preemptively predict failure and system degradation. We talked about chaos engineering, necessary to inject chaos into the system and testing its resiliency, and finally, observability. Observability provides us full transparency of the system through end to end monitoring. Now, we have tested and confirmed I hypothesis today, which claims that aiops can leverage observability in order to identify when a chaos experiment is running. So basically, AIops is able to detect that. We have shown this today with our live demo. Furthermore, through this continuous cycle of hypothesis and experimenting, trust in the system is built. And with every experiment, its reliability increases. A final takeaway, I would like to point out from today's session, start simple and scale fast. So you don't know where to start from. So what? Start from a simple experiment, see how it goals, see how the system reacts. And as you proceed, you can scale. You basically build more, and youre on top of that. Well, it seems it's time to close curtains. Thank you for watching this talk, and I hope you got something out of it. Until next time, cheers.

Slides

Download slides (PDF)

See all 24 talks at this event!

Conf42 Chaos Engineering 2022 - Online

March 10 2022

Chaos Experiments under the lens of AIOps

Video size:

Abstract

Summary

Transcript

Slides

Michele Dodic

SRE DevOps Engineer @ Accenture

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2022 - Online

March 10 2022

Chaos Experiments under the lens of AIOps

Video size:

Abstract

Summary

Transcript

Slides

Michele Dodic

SRE DevOps Engineer @ Accenture

Join the community!