Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to my talk, manage alert overload with AI ops.
In this talk, you will learn how you can leverage AI and large
language models to solve one of the biggest problems when being on call
alert, flooding and alert fatigue.
My name is Birol and I am the CEO and co founder of iAlert.
iAlert is a software company based in Germany and provides an incident
response platform and covers everything during the incident response lifecycle,
Preparing for incidents, responding to incidents, communicating incidents,
and also learning from incidents.
Some of the capabilities that we have are sophisticated alerting, managing
on call and schedules and overrides in iAlert, and also communicating
incidents through status pages.
Our solution is used by small and large companies alike across the world,
and DevOps and ITOps and many service providers use iAlert to, Improve the
operational efficiency, efficiency, respond faster to incidents and
provide a better service and better service uptime to, to their customers
before we dig into the talk.
Let's first start with the why, what, why is it important, to
effectively manage alert overload leads to, high MTTA and high MTTR.
Makes you, makes you less efficient and, increases the stress, right?
And in a situation where, you cannot.
You don't need additional stress because we're talking about, situations
where you're probably experiencing a major incident and being paged at 3 a.
m.
in the morning.
And the last thing you need is additional threat, additional stress through a like
mismanaged alert configuration where you're being flooded with, with alerts.
And in this webinar, I'm going to show you how you can effectively
manage alert overload using something we call intelligent alert grouping.
And to set the stage, I'd like to show the results of one of our customers,
only after two weeks, that particular customer was able to reduce alert volume
by 93 percent and also improve their response time by more than four hours.
And as every minute you save in response time is a minute improvement in your MTTR.
And every minute in improved MTTR, leads to a better, like a better
uptime and an uptime is like service.
Uptime is probably one of the North, North star metrics that
every, SRE engineer, cares about.
All right.
Before I actually start talking about intelligent alert grouping, I would
like to show you other use cases.
Where we already, have been successfully leveraging AI and these are use cases.
And we usually like to like cluster things around the incident response life cycle.
And, one, one use case that we've implemented probably like six
months ago or so, is, using AI assistance for on call scheduling.
This is a simple chat interface where you can just lay out your requirements
and then, the AI assistant will walk you through the, like the schedule creation
process or not walk you through, but just get all the input that's necessary.
Let's say you have a complex schedule.
You want to create a follow the sun schedule.
You can just.
in, in like using natural language.
So that's one area where we've been using, AI.
Another area, probably the most obvious one is, when you want
to communicate incidents to your customers, maybe on a status page.
And, when you want to communicate an incident, first and foremost, you
don't want to spend too much time.
on communicating incidents because what you actually want is reduce the
business impact and solve the problem.
but of course, communicating incidents also helps you, like focusing on the
problem because then you have like less people asking what's the status
or less customer support requests.
And we've embedded, like AI and large language models into our software where
you can just provide a few words and it'll create the entire incident for you.
Use a message in a tone that's appropriate for the audience
when you want to communicate incidents and make it very easy.
And, another area where we're using AI is, in the post, like post production.
post incident phase, which is, which are, the creation of post mortems.
So post mortems, they are, they are commonly used across tech teams
where they see an incident as an opportunity and try to learn from them.
And a post mortem document is essentially a document that lays out
what happened, the incident timeline, what actions were taken, what was the
business impact, what was the root cause and what are we going to do in
the future to avoid this incident.
In other words.
How can we learn from this incident?
And if you ever happen to be in a major incident, you know
that, first of all, there's lots of data that's being collected.
One type of data is machine generated, like monitoring
tools, sending constant updates.
maybe you have like other automation tools that do things, then, you,
you usually collaborate in a, in a dedicated, environment.
channel chat channel.
what's what we usually would call a war room.
and then, you collaborate on this incident and exchange a lot of
information, coordinate tasks and really, collaborate on the incident.
And these, like all these data, They contain lots of information, which,
which are usually the foundation for your postmortem document.
Now you can either go ahead and, recreate the timeline yourself and
whatever was set, whatever actions we're taking, or, And, In this area, it
almost feels irresponsible not to use, not to leverage AI, have an AI, like
scan through all the data, scan through your chat history, look at all the
machine generated data and reconstruct the incident timeline, what actions,
were taken, what was the root cause of the incident, based on the chat, chat
messages and really give you an 80 percent version of your post mortem documents.
So you can spend the remaining 20 percent on, like the most crucial
phase, which is like learning from this incident and discussing
with your team how to get better.
And we think that's an area, that's, it's meant to be, meant to be
improved, with, with the help of AI.
But today we're not going to talk about, These incident stages, we're
going to talk about the response phase.
And in particular, we're going to talk about how you can leverage
AI to, to reduce alert fatigue and manage alert overload.
This time I'm going to do something different.
So usually I first like layout what were, like how we did it, and, talk about
some theoretical concepts, how we've implemented, intelligent AI grouping.
But this time I would like to like first show you.
AI grouping and action.
And then, which is probably the most interesting phase.
and you just see how it works.
And then, with the hope that it's like clearer when I try to
explain how we did it, because what the end result was, right?
So let's start, with a small demo of, intelligent alert grouping.
but before I start, let's quickly define what we mean
by intelligent alert grouping.
So intelligent alert grouping is, the process of identifying alerts
that refer to the same issue and consolidating them into one alert.
and this reduces noise and prevents, alert fatigue, prevents you from being
overwhelmed to have looked, having to look at multiple alerts and seeing
whether this is a new issue or whether it refers to the same issue that you
were just paged a few minutes ago.
so let's get into, the demo.
So this is, I learned, right now there are zero alerts.
And, what I'm going to do now, I'm going to create a few sample alerts
and see how they behave and, and alert grouping, is turned on right now.
So let's switch to Postman and I'm using Postman here just for this demo,
but, this can be any monitoring tool.
So we are, agnostic when it comes to monitoring, we have 100 plus integrations.
but for the sake of this demo and to make this a little bit more, smooth and not
having to jump between, tools, I'm going to use postman and trigger a few alerts.
And the first alert is about.
A coffee machine being, like being down, in an office.
So the alert says coffee machine on floor two needs a cafe in boost
maintenance, so let's send this alert to alert and of course, what we'd expect
is that a new alert is created, right?
That's exactly what happens because it's the first alert and there's
nothing to group and let's now move on.
create another alert regarding the same problem, but use a different wording.
alert coffee machine on floor two is taking a coffee break.
So it's still the coffee machine that's not working properly, but
we've rephrased it a little bit.
So let's send it to iAlert.
And as you can see, there still is only one alert and there's a small indicator
that's, that's animating here that says, there are already two alerts that
have been aggregated into this alert.
And if we would look into this alert, we'd see which alerts have been
aggregated, but as you can see, although the two alerts are not, are only
somewhat similar on a textual level, the AI is able to understand that.
Okay.
These refer actually to the same problem.
And, let's move on with another example.
let's try to be funny and say, instead of talking about coffee, we're going
to talk about Java, which also means, coffee, no Java from the Java machine,
floor two coffee maker is down.
So let's send this to I alert.
And again, here would expect no new alert to be created because it's the same.
Coffee machine on floor two.
So no need to create a new, new alert.
And as you can see here, the indicator, is now three.
So there are already three events that were aggregated with these alerts.
Alright, let's provide a counter example.
Let's say something else on floor two is broken.
The water cooler on floor two is broken.
Let's send this to iAlert.
And here we expect that a new alert is created because water cooler and
coffee machine are different things, although they are on the same floor.
So here we can see, even though, like there is, there are some elements
that are different, the same on a textual level, but the underlying
semantics are different because one talks about coffee machines and the
other one talks about water coolers.
So that's why a new alert, has been created.
And then, and the reason why this is important, like even if you
are like, you might not even be the recipient of the second alert.
This might be a completely different team, right?
And you don't want, to aggregate an alert.
that wasn't intended for your team.
So otherwise, like the team might never know that there
is something is not working.
and yeah, and the alert, which is the worst case goes completely unnoticed.
And, you're informed by your customers that something's not working.
No, let's, move on with another example.
Probably you're thinking, okay, these are good examples and
they are easy to understand.
But they don't exist in, in the world of tech.
Like my alerts are not about coffee machines or water coolers.
My alerts are about like hardware, about, Kubernetes clusters, about
pods, not being, like being terminated.
CPU is running high, so on and so forth.
All right.
Fair enough.
So let's use a more technical alerts.
One that says the hard drive is full.
So let's.
Let's send this to iAlert and as expected, a new alert is being created.
The hard drive is full.
And let's rephrase the same problem with different words.
this time we're not going to reuse any of the words that was in the previous alert.
So the alert is the storage capacity has reached the limit of 90%.
As you can see, there are no common words between.
The hard drive is full and the storage capacity has reached the limit of 90%.
I'm going to send this alert
and quickly refresh.
And as you can see, Both alerts have been aggregated into one,
so no new alert was created.
So this, I think this example shows perfectly what we mean by, like when
the semantics of an alert is captured.
and we're not only looking at the, textual similarity between alerts.
And here, you can see, that's right now alert grouping is still in progress.
the grouping window is set to five minutes, I believe.
So if there are no new alerts, the grouping will end in a few minutes.
And you can also see like the entire payload of, of the events
that came in, you still might be skeptical because, these are.
natural sentences.
Like these are maybe not representative for real alerts.
You might be thinking my alerts are like way bigger.
They contain lots of more, a lot of, more information.
And they might be even a little like cryptic in nature.
So let's try that.
So let's try.
And what I'm about to do now is, I am going to, use alerts.
Actually live alerts from, from a production system.
And these were alerts where we know that, there were instances where there was.
Many others were being created.
So there was, there clearly was a case of alert flooding, which
means like in a short period of time, like maybe 10 minutes, 15
minutes, lots of alerts were created.
And I think in this example, it's even five minutes or so.
And, what I first, what I'm first going to do is I am going to, disable
intelligent alert grouping first.
So we see first the effect of, okay.
alerts being created without grouping.
And then we'll look at that at the same alert.
We will recreate the same alerts with grouping enabled for this.
I have configured this, this, runner in, in Postman, and that will,
create 31, like requests, alert, sent 31 events to our events API.
And these were, these are actually alerts from, like from, from Prometheus.
So again, real, real alerts that were generated in a short period of time.
I'm going to run them here.
although there were like.
The generation they took, on production, the alerts were generated over a
time period of five minutes or so.
I'm going to run them like immediately with 50 milliseconds
delay so we don't have to wait.
So let's run this.
So this is done.
31 events have been sent to iAlert.
Let's switch to the alerts, overview.
And as you can see here, we have literally 31 new alerts.
So none of them were grouped.
and this is exactly the way how they also were created in production.
And some of them like at first sight, they, they look similar, like
things like black box, probe HTTP failure, on different instances.
Some of them were on, on, on the same instance.
so yeah, and then some of them are clearly, like other issues, for
example, there is an alert regarding a RabbitMQ node that's down.
So let's now try the same example with alert grouping enabled.
For this I'm going to, modify the alert source settings in
iAlert and enable alert grouping.
And there is a default grouping window, which like, doesn't matter
for our, purposes right now.
So it's five minutes.
so it will all, like the alert grouping will always happen
during the selected, window.
And there is another important parameter.
It's the similarity threshold.
And I'll be like talking later in the presentation, what exactly just this
threshold means, but here to make it, a little bit more To show you the
effect of offsetting the threshold.
We have a small preview When the threshold is it's a number between
0 and 1 and if it's closer to 1 then fewer alerts are grouped and if it's
like very low then almost all the alerts are grouped and we're gonna
leave it at the default F of 0.
75 and here like you have an idea based on the past alerts
How this grouping will affect?
like future alerts Okay.
So let's save this and now let's run the same, runner again.
So we are again, creating 31 alerts,
but this time I forgot, I actually forgot to, to resolve the previous
alerts, but it doesn't matter.
So if we go to the alert overview here.
We can now see, previously we had 31, we have now, no,
previously we had 30 yes, 31.
Now we have, 34.
That means only five new alerts have been created.
And you can see here, there are many alerts that have been, aggregated.
For example, these black box probe, HTTP failure alerts.
They have been aggregated, because there are like many alerts that, that, that are,
like semantically similar to this alert.
And, other alerts to the contrary are not, aggregated.
For example, this hardware service status is not healthy.
you know what I'm going to do?
I'm quickly going to resolve everything.
So we get a clean slate.
So it's clear, it's clearer how many alerts you would end up with, when you
have intelligent alert grouping enabled.
So let's do this again.
So it was empty.
Now we have submitted again, 31 new alerts.
And instead of having 51 alerts, we now have 1, 2, alerts.
as you can see, these.
Also work with like real, alerts, production, alerts.
And, yeah, it's capable of, of understanding like in this case,
for example, it's probably because, the black box probe HTTP failure,
it's probably because there were like completely different instances.
and it actually looks at the entire payload, and, I'm going to explain
later, what pre processing is involved, before we, like we, we, we use AI to
actually, to do the de duplication.
But yeah, but this problem, for example, it's like, it's clearly,
it's completely separate problem.
Hardware service status is not healthy, is also a separate problem.
And, yeah, and this, like some.
firewall outside subnet, running out of IPs is also a
completely different problem.
yeah, so that was the demo.
now let's move back to the slides.
And I would like to walk you through, how you can build the exact, same
behavior into either your alerting system, or maybe even like you use it,
somewhere else, with a similar use case.
All right.
first I'm going to talk about a few concepts that are required to understand,
to understand, the overall, the overall, process, how we've built this.
Let's start with vector embeddings.
Vector embeddings, they are, a vector embedding is a mathematical
representation of any kind of data.
in a high dimensional space.
So where each vector represents a specific kind of data, could be a word,
could be an entire sentence or an image.
and this piece of data is essentially transformed into a
vector and a vector, um, is a point in a high dimensional space.
And these, and these vectors or points in this, in this space, they capture the
semantic relationship, to each other.
So the closer two vectors are, and we measure this by, using distance functions,
mathematical distance functions, and the closer two vectors are in that
space, the more likely is it that they refer to the same underlying concept.
And these embeddings are, used almost anywhere in, in, AI applications.
When you, for example, use chat GPT, your prompts are transformed
into a series of numbers first.
And similarly, we will also transform alerts into a series of numbers.
An embedding model is a special type of pre trained machine learning model
that learns to represent complex data, such as words, sentences,
images, in a lower dimensional space.
so this is actually where, like most of the magic happens.
like we use these pre trained models, to transfer, to transform
alerts into, into these vectors.
And once, And once we have everything in place, we can, we can set up our,
like alert deduplication pipeline, which consists of, three steps.
The first step is, we pre process the alerts.
And the second step is we, turn these alerts into vectors
using an embedding model.
And then we apply the de duplication logic, which I'll get to, in a second.
So let's first start with pre processing the alerts.
Pre processing involves like normalization and cleaning.
During normalization, we, just make sure that all the alerts that we have,
they, they follow a common, format.
And for us as a.
platform that sits on top of like many observability and monitoring tools.
This is more or less already the case.
So we integrate with a hundred plus tools and we turn all these alerts into a,
into a common format and in the cleaning phase, we're going to remove everything
that's not required for the task at hands.
Like for example, if your alerts have like unique IDs.
The unique IDs, they don't play a role when you want to answer the question.
Is this a are two alerts, the same or not?
Because the IDs will differ anyway, or we're also going to
remove any, syntactical elements.
for example, if, alerts are represented in JSON, we're going to like, remove all
of the JSON, elements, and reduce the alerts to the essentially plain text.
This will not only reduce costs or reduce tokens, which are the primary cost driver
when, using and developing LLM apps.
but it will again, just remove everything that, you wouldn't
consider, for deduplication.
And then we, vectorize these plain text documents, using an embedding model.
And in our case, we use a, like a self hosted embedding model, but you could use,
one of those general purpose, embedding models provided by open AI, for example.
So they have, API APIs for their, their text embedding models.
And you can, and you could use those.
The reason why we're using.
a self hosted model is because for us, it's, this is a high
throughput use case, right?
So we're processing millions of events on any given day, and we don't want to
introduce an external dependency where we have to make an external HTTP call.
just simply for performance and like stability reasons.
therefore, we've, we've selected one of those.
embedding models that are available on a hugging face.
And, it's in this case, it's a general purpose model that was trained on a wide
variety of text, like including the common data sets such as Wikipedia, but also,
more were, data from the stack overflow was used for example, and that's why
that, the general purpose model is also capable of capturing, technical content.
And then, once we have, once we transformed the alerts into vectors,
we store them in a vector DB.
this is of course optional, but storing them in a vector DB, is more efficient
because vector databases are optimized for, like storing and more importantly,
managing vectors and querying vectors.
And one operation we're going to use is, is calculating The
distance between vectors, which will represent our de duplication logic.
we have an incoming alert flow.
Every alert is vectorized the vector database.
And for each incoming new alerts, we're going to look at, we're going
to query our vector database and see.
Within a time window of let's say five minutes, order any vectors
that are close to the one that we're looking at right now.
And this closeness is measured, through a, like similarity, function.
It could be like any similarity function that measures distance between, vectors.
For example, the cosine, similarity.
And, and then we set a threshold for this distance metric.
Returns a number between zero and one, meaning vectors are, essentially
the same, are very close to each other and zero, is they are very
far, far away from each other.
The main advantage of this approach is that we're capturing the underlying
semantics, of, of an alert instead of.
Looking at their textual similarity.
And you can even use this if you're monitoring, your service on like on
multiple levels, you can use, intelligent alert grouping across, multiple,
monitoring and observability tools, things that you have to consider, are of
course, like selecting a model that's, appropriate for, like your domain.
and then you just, Making sure that it works as expected, because these
models, they can behave differently.
For example, if, if they weren't trained on, on non English data,
they might behave differently if your alerts are not in English, for example.
And of course, like looking at the thresholds, playing with the
threshold because the threshold itself, That doesn't mean anything.
So a threshold of, for example, 0.
75 could be perfectly fine for, like one specific, area, but for another,
you might want it to have in, in, in 0.
9, and I showed you one way to do this is, just, testing the threshold
on past alerts, where, we use the slider and simultaneously updated the
grouping, based on the past alerts.
And then we, last but not least, we do also provide, or
collect feedback on all alerts.
so it's a simple thumbs up, thumbs down, where you can, say,
okay, this was correctly grouped, or this was, falsely grouped.
and we make these metrics, transparent and based on these metrics, you
can further fine tune, the model.
All right.
thank you for, listening.
we, we do have, a dedicated guide, for AI and incident management, which also covers
the other use case that I like, briefly hinted, in the beginning of this talk.
Feel free to scan the QR code and download, download the guide.
these guides are also available, on guides.
eiler.
com, in an HTML version where you don't have to download a PDF, but
if you're more, comfortable with PDFs, just scan the QR code and
download the, the PDF guide for you.
I'll be, this is an online conference, as I'll be around, if you have any
questions, feel free to drop your question, in the Slack channel.
Thank you.