Abstract
Imagine this: you’re a Site Reliability Engineer (SRE) at a major tech giant and you are responsible for the overall system health, which is running in prod. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities, which sometimes simply feel like a ticking time bomb. These are just some of the daily struggles an average SRE needs to go through. But why should it be like that? Well, it shouldn’t - thanks to a term coined by Gartner in 2016. AIOps, meet audience. Audience, meet AIOps.
Let’s extend this scenario. On top of all of the above mentioned issues, our poor SRE needs to watch out for potential security breaches and make sure nothing ever gets in through the cracks. However, by conducting proactive experimenting, continuos verification and improvement, he makes sure that the system is able to withstand these turbulent and malicious times that we’re living in. Do these notions ring any bells? They sure do! Chaos Engineering, meet audience. Audience, meet Chaos Engineering.
What’s our angle, you’re wondering? AIOps and CE are two concepts, which are often kept separate. In this talk, we will discuss (and show you!) how both practices combined can significantly increase cyber resiliency, while at the same time maintain full E2E transparency and observability of your entire system.
For this session, we have prepared and analyzed several use cases, followed main principles, summarized best practices and prepared a live demo through a combination of CE and AIOps tools.
Above all, we are SRE Engineers. As such, during this session, we will stay close to the SRE principles and best practices that we used to achieve our goals, e.g. reduce organizational silos, measure everything, learn from failures, analyze changes holistically, etc… As we proceed with our talk, the audience will be able to identify how these are related to AIOps, as well as CE, and finally, how it all ties together.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica makes up real
time feedback into the behavior of your distributed systems
and observing changes, exceptions,
errors in real time allows you to not only experiment
with confidence, but respond instantly to
get things working again.
Close hello
and thank you for watching today's session. Chaos, chaos, chaos experiments
under the lens of my name is Michaelem. I'm an SRE
DevOps engineer, and I have a background in software engineering,
AI, industrial automation. So naturally, I would say all
of these areas pointed me towards aiops, which is my current specialization.
This is our agenda for today. I'm going to define the basic concepts surrounding
aiops, followed by an in depth view of how it can be
applied to chaos engineering. I think the highlight of this session
for many will be our live demo, after which we'll wrap up and summarized
everything that we've talked about here today. Before we start, let's set
the scene. These are our two focus concepts.
Aiops, a term used to indicate the application of ML
analytics to it ops in order to
prevent system degradation and failure. In simple words.
And then we have silos engineering. We'll use CE abbreviation
throughout this presentation, which is, as you know, all about
experimenting and testing our system's resiliency. Now imagine
this. A hardworking SRE site. Reliability engineers
wakes up one day, thinks about all the tasks that await
him. So many tools, tickets, incidents,
mail. Basically an avalanche of responsibilities.
That sounds quite overwhelming, right?
I think, to be honest, all of us sometimes have a feeling like that.
But then, as he sips his morning coffee, can idea comes to
his mind. Why the hell should it be like that?
If only there was a way to smartly and efficiently organize
and automate all these tasks. Well, there is one.
Which is why we'll start with today. With today's rapidly increasing
and more and more complex stack systems have become
extremely critical. As I mentioned, can average SRE today
needs to deal with tons of errors, warnings,
tickets, critical alert? It just doesn't end. Doesn't it?
So how can we help our poor site reliability engineer with
his daily struggles? Well, for starters, let's introduce
AIOPs artificial intelligence for IT operations.
But what is exactly aiops? How does it work?
We can segment it in three areas.
Observe, engage, act. To the left, we see the
ingestion of historical and real time data in the form
of logs, metrics, text into an ML
model, which ultimately produces actionable insights.
And that means anomaly detection, performance analyzed,
and so on. So, in simple words, we want to predict,
or even better, we want to preempt failure before it even occurs.
AIOPs collects all kinds of data network application
storage. As we said, the goal is to predict failure,
identify the root cause error, and reduce alert noise.
Furthermore, autoredemiation and adapted self
healing are also important concepts which refer to the
ability to resolve a failure before its occurrence.
That means to enable self healing before a problem occurs.
Basically, the paradigm, if you think about it, is shifting here from
reactive to proactive. We don't just detect
errors, we prevent them. Also, remember that the
AIOPS model is continuously collecting data and continuously
learning from it, and therefore continuos optimizing itself.
But let's extend our scope a bit now. How do we experiment?
Inside aiops? We have our aIops solution,
but now we want to test its resiliency, robustness and
reliability. How do we do that?
Let's read two the following definition a discipline of
performing security experimentation on a distributed system in
order to build confidence in the system's capability to withstand turbulent
and malicious conditions. So what is this?
As we already know, this refers to chaos engineering.
We want to test our AOP solution by conducting chaos
experiments. Now we need to ask ourselves, how do we
plan chaos experiments? So this diagram here
illustrates the continuous cycle of hypothesis and verification.
What is this? We have these steady state posture which we get through observability.
Don't worry, I'll get to this concept soon. We form a hypothesis,
sort of is my system resilience to the disruption of
xy services? This is an example hypothesis. We put
it to the best through continuous verification, at the end of which
we summarize the lesson learned and implement mitigation. This is a
continuous cycle, as at this point we start again with the whole process
to further experiment our system. We talked about aiops and
c chaos engineering, but where do these meet?
There are several things, but for today's scope,
the most important one is observability. Were are a couple of notions relevant
for observability. We have different sources of data illustrated
to the left side, and basically, observability is the ability.
Two measure a system's current state based on the data
it generates, such as logs, metrics, and traces the
so called golden triangle of observability. While the golden triangle
signals are latency, traffic errors and saturation.
In our case, as you will soon see, latency will be the
one that we will use for our upcoming live demo. So let's get started.
So far I think we learned about aiops, we learned about
CE, as well as we learned about observability.
So what now? Now we can run a chaos experiment.
There are two questions running on SRI's mind.
One, how does aiops react if I run a chaos
experiment? Two, is aiops capable of recognizing
a running chaos experiment through observability?
Now that we have all the necessary munition to formulate
youre hypothesis, which goes as follows,
we assume that if we run a chaos experiments,
aiops will be able to detect that there is one.
The experiment will then either fail or it will be under control. These are
the possibilities. Just a quick overview in terms of our architecture,
youre online boutique, which is a mockup web store composed of eleven
services splunk observability, a platform for endtoend
monitoring which monitors our boutique locust load generator
used to simulate active users on our boutique, which are navigating and clicking all
over the place. And then we have litmus chaos,
which will introduce chaos into our system into our boutique. As we said,
latency will be our target. Now, while the observability component
collects data from the boutique via the open telemetry collector. As we can
see, it is the aiops component that sits on
top of splunk observability, and that component is responsibilities
for detecting and predicting an increased latency.
Well, I think we're all set now. Let's go. This is
my environment. I'm running everything in minicube. This here is
my online boutique store, a cloud native microservices demo
app. What I want to show you quickly is that
it consists around, as we can see here, around ten microservices
simulating a web based ecommerce application. And it's all running inside
my mini cyber. Also this is,
if you're interested, the GitHub project for the
boutique store. I personally find it pretty handy.
So on the other hand, we have splunk observability,
which is a platform that provides monitoring across infrastructure,
apps, user interfaces. So it basically provides end to
end monitoring for the entire system through its entire lifecycle.
What we can see here is a these view of our infrastructure and
all of youre microservices in the boutique store.
What we see here marked in red, it means that there's probably
already an issue that's been detected. So what is but doing here?
It's basically trying to identify the root cause of the issue.
It starts with some issues on the front end and it tries to track it
down all the way to the root cause, which is the payment service.
What does that mean? That means that perhaps our users can
browse through the catalog and put items into baskets,
but might have issues while proceeding with payment it's important to
mention that the AIOPs component licensed blanc observability
two, which gives us the opportunity to apply AI ML
data analysis in order to predict all sorts of events such as
failure, system degradation and so on.
Another relevant component running in our minicube
is locust, and locust is an open youre
load testing tool which I'll use to simulate proactive users in my
boutique. I can easily just choose the
amount of users I want two simulate, so in this case
30 spawn rate is two.
And the moment I click transforming it will
basically start simulating all of these users into
the boutique store. The last relevant component I
must show you is Litmus chaos.
So this is also an open youre platform. It's a chaos
engineering platform that we use to introduce some chaos. So now
that we've made a summary of all of our components, let's start
an experiment. What I want to do
is I want to inject latency through litmus
chaos into my card service microservice.
Basically I want to see if slunk will be able to detect
it. What they created here is a very simple dashboard
that's just tracking the latency.
As you can see here, youre can set also the time span
that you want to focus on past day, past week, past hour.
And just to show you how I created the alerts condition,
how it works. So right now I specified sudden change,
which as it says here is useful for indicating an unexpected increase in latency.
But I could have also easily chosen historical anomaly.
What this would do would basically use the latency
patterns from the past, if they are patterns and
existing patterns, and use that in order to detect
and predict if something is off or not.
So I think were ready now to start the experiment. Let me
go into litmus chaos. So you click here on litmus
workflows. You want to create a chaos workflow. So you
click schedule a workflow, select the agent.
We want two create a fully new workflow, although you could in theory
also use a template if youre have one.
Here we click add new experiment. So as you can see here,
you can inject all kinds of chaos. So container kill,
cpu hog, network close.
But whats we will do right now is we want to inject network latency.
Before we proceed. We need to tune this
a bit. So I
need to specify the target here, which is card service.
This is the name of my microservice.
We don't have any probes right now. The duration of the
chaotic experiment, let's say 200
and the network latency let's make it 4000
so we can detect it.
Also, never forget to click here in advanced options and
to enable cleanup chaos.
This basically cleans up the chaos and restores your environment after
the chaos experiments is over.
Now that we've set this all up, we can proceed.
We are scheduling it right now. Click on
finish. And now
let's see the workflow.
So now what he's doing is setting up the chaos environment,
after which he will start conducting
latency issues.
So what we can see here, what we will see here,
and maybe we can already see it actually, is that in locust
we have all the requests that our simulated users
are executing, as well as the failures here,
which will increase even more after these cows
is injected.
Okay, you also see here the percentage
of failure, which is something pretty handy. Now let's
go check out if
our dashboard, if our splunk observability platform detected
anything. And yeah, actually, as you can see here right now,
this is these latency dashboard we can see this red triangle
already indicates that there was an alert.
Let me quickly dig through my emails to
see if I've received something. Yeah, I did.
So as you can see here, splunk observability critical alert,
it says latency. The latency in the last eight minutes
is more than three deviations above the norm. So it
basically alerted me that there has been
an increase in latency. So if we look here at the graph,
you'll notice that before we have this peak here, before we
have this significant increase in latency, which is
larger than anything, before we
notice that the alert comes actually before the increase.
So what we're using here, we're using our AIOps model to
basically predict when there is something off in latency before
we actually have the error. That way we can actually even prevent
it. In reality,
we can do even more than just alerts.
So let me show you
the settings. If you go
here, two alert message. You can actually
specify here, runbook. This is pretty interesting
because within this runbook you can give splunk
observability some actions to do to remediate this issue. For instance,
it can rebuild,
it can reset the node that's failing or that has issues.
So this is actually what we're talking about when we're saying that we're shifting
from a proactive to a predictive paradigm.
So if you remember, we posed a question earlier,
and the question was, how does AIOps react? If I run a chaos experiment,
as we can see in this case, it detects latency increments
and promptly alerts me. So let's put a checkmark on
that. The other question, if you remember, we posed was,
is aiops capable of recognizing a running chaos experiment?
For that, I've but another really simple dashboard.
So this dashboard basically contains
a counter for every time the specific pod
is launched within the litmus mailing space. And this ultimately
shows me every chaos experiments that was running. So here
I said in the last week, it gives me and detects every
chaos experiment I ran, and it also gives me account.
So, in conclusion, we've proven our hypothesis as AIops
is actually capable of detecting a running chaos experiment.
Okay, let's wrap everything up. We talked today about
aiops, which is necessary to preemptively predict failure
and system degradation. We talked about chaos engineering,
necessary to inject chaos into the system and testing its resiliency,
and finally, observability. Observability provides us
full transparency of the system through end to end monitoring.
Now, we have tested and confirmed I hypothesis today, which claims
that aiops can leverage observability in order to identify
when a chaos experiment is running. So basically,
AIops is able to detect that. We have shown this today with
our live demo. Furthermore, through this continuous
cycle of hypothesis and experimenting, trust in the system is
built. And with every experiment, its reliability increases.
A final takeaway, I would like to point out from today's session,
start simple and scale fast. So you don't know where to start from.
So what? Start from a simple experiment, see how it goals,
see how the system reacts. And as you proceed, you can scale.
You basically build more, and youre on top of that.
Well, it seems it's time to close curtains. Thank you for watching this talk,
and I hope you got something out of it. Until next time,
cheers.