Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability with
chaos native. Create your free account at
chaos native. LiTmUS cloud hello
and welcome to my talk. Incident response, incident management
and alerts. And where do they fit in cloud operations.
Now, if you just read that and you're thinking to yourself,
aren't all of those the same thing, then you've come to the right
place, because they are not the same thing. And that's exactly the purpose
of these talk. I'm very passionate about the idea that
these are not the same thing. Before we begin,
I want to use a little analogy. This analogy comes
from the fact that I really want a Tesla. And some
of my coworkers recently got Teslas, and I'm a little bit jealous.
So maybe I want to more emphasize the power of these Tesla than
this particular analogy, but really, it's applicable.
So when I was growing up, one of my first cars was, it wasn't
a Honda Accord, it was actually the Acura, but it's
very similar. And with the 1998 Honda
Accord, when something breaks, it's pretty
clear. And we have a good understanding of what to
do when that happens. And the types of things that can
break in the 1998 Honda are.
There's a lot of things that can go wrong, but they're nothing
compared to what we see in the Tesla.
Why is that? Well, technology has changed. So in
the Tesla, not only are we dealing with
the same mechanical challenges or some
of the same mechanical challenges you find in the Honda, we have a
whole bunch of other technologies that
are bundled in here. So, yes, things can go wrong with your brakes and
your tires and your motor. Even though the nature of the motor is extremely
different, the Tesla is also running a whole bunch more
software that you never found in the Honda. And the complexity
of that software is such that it also is running in
a cloud native type way, with containers and
microservices, et cetera. So when something breaks
in your Tesla, finding out
exactly what went wrong, you can't just rely on
an alert or a check engine light. You have to dig
deeper. And to dig deeper, you need better tools
for it. So that's what this talk is all about.
The nature of applications has changed
to be similar to the nature of cars. If you think of your
1998 Honda as your monolith and your 2020
Tesla as your microservices cloud native
application. And as a result, how we approach fixing
those problems also has to change. So my
name is Chris Riley. I am a senior technology advocate at Splunk. What does
that mean? It's basically I was a developer.
I really enjoyed software development, but it wasn't my forte.
So I'm a bad coder turned advocate. I couldn't give
up the lifecycle and what it takes to build
better applications faster. So now
I talk about it. If you scan that QR code, you'll get access
to other information about me. Please connect.
I love to hear from people who have attended my sessions,
both good and bad. If you have feedback on how to do better,
please let me know. That's how I improve, but also
just reach out to say hi. And there's a few fun
little games I have on my social profile there.
So back to what we were talking about, and that is that
transformation is a given in the technology
space. It is hard even in a these six
month span to keep current on all the things
that are going on in the DevOps and the application development
market. So as a result, we have to think
about change as a constant. Now we
can't just expect to do what we did historically
with modern applications. That also
has to change. And ideally the way we monitor
and support our applications should be ahead of
the transformation of the application itself. Most organizations
we find are currently in this lift and shift
and refactor stage of application
development with a lot of companies are in that cloud native world
and they're born cloud native and they kind of have all the
practices I'm talking about today kind of ironed down
from day one where they've actually embedded
their monitoring as a feature of their application.
But most people don't have that luxury. We're going from shifting
our workloads to the cloud and then breaking
them apart with the strategies pattern and making them more like
a cloud native application. So because
this is happening, we have to acknowledge the fact that as
we transform, we can't just think about supporting
our applications the same way. What used to be
in monolithic application architecture is we could always go to that
one server and we could look at the alerts
on that server, pull up the alerts, kind of see what's going on. And generally
because these servers were set it and forget it,
the things that went wrong were predictable.
Like the 1998 Honda. I don't know if
you've experienced this, but back when I was maintaining servers and
data centers, we always had like one server,
the Billy Bob server, and that Billy Bob server. Every two
weeks we had to restart. And when we restarted it, everything was
magically better. Again, did you try turning it off and
on? And we did, and it fixed the problem.
We weren't really looking at what was happening at the application,
and alerts actually were enough
context to address these problem, because that infrastructure
never changed. The number of things that could go wrong were fairly limited.
Now we're talking about applications with distributed architectures,
microservices based architectures. So instead of one
server, we have many, many services, all doing
very small, very specific things, and each service
has a contract to the rest of the organization,
but the service itself doesn't care really.
So all they care is these take in some sort of input
and they output some sort of data. But anybody can
consume that service. Well, because the web of
these things can interconnected in unknown ways.
It's not possible for me as an operator to go to these Billy
Bob server, because there is no equivalent in
the Billy Bob service. There is no one service that
I can just always go to look at the events for that service,
restart the service, and be good. We are not able
to conceptualize and hold in our minds the
architecture of these applications and the
entire nature of the application like we could in monolithic days.
So we can't expect to go and find the problem.
The problem has to find us, and it has to find us
in a way that has good enough context that we can resolve the
incident quickly.
So that's where, when we start to look at the relationship
between the services, we more have a visual, kind of like these.
And again, if we were to approach it with an alerting
approach, we could probably go and find an alert that would
give us some sort of detail. But it may be the wrong source,
because we can have an API gateway throw an alert and
say things are broken, only to come to find out way
too long down the road that it had nothing to do with these API
gateway. It wasn't the API gateway that was
causing the problem at all, it just has cascaded down. And the
API gateway was the first to scream at us.
So this becomes very complicated.
And as you can see, alerts are not sufficient
in this type of scenario. But also
because of the complexity of these modern applications,
the life of those who are on call becomes more
difficult. Nobody wants to be paged at night.
Being on call sucks. You could be a developer who's on call,
you could be an SRE, a cloud operations engineer.
Everything we do, and we are very guilty about talking about features
and functionalities of technology, but everything we do
is in the service of making the on call experience
better. That's ultimately what we're trying to do
here so we can get all convoluted about monitoring
and observability and all that stuff. Really all we're trying to do is
make that on call experience better. What does better
mean? Well, first of all, it means hopefully you're never woken up at night.
That may still happen. Second means if you
are oncall and you are woken up,
you're the right person. It actually gets to the person who
should be resolving the problem. If you aren't
the right person, hopefully you're giving enough data that
you can resolve it yourself or know exactly
who to go to to resolve it.
So logs simply are not enough. You cannot
be on call, go from an alert to logs.
Everything has to change. Well,
because of that, we get this beautiful new term
and it's called observability. And if you're really cool, you abbreviate
it eleven y. Maybe even pronounce it Ollie.
That's the upper echelon of the terminology club there.
If you're a good curmudgeon like me, and you hear these terms,
first thing you do is say, no, we don't need another term,
it's just monitoring. Yes, you're right,
observability is just monitoring.
But the reason we use the term observability is the nature
of monitoring has changed, which means the things that we
need to do to support our applications better has also changed.
And in the world of observability,
alerts are not incidents and incidents
are not tickets. Now, if you remember at the very beginning I said,
you're in the right place if you think that these are the same things,
because they're not. They are fundamentally different and
they need to be different because whereas alerts
potentially got us to a log, that was good enough,
we now have to live in this world of incidents.
But incidents imply something that are completely different than
ticketing and using a ticketing system.
Well, let's go back to the car analogy and see how this comes
together. So alerts are
like your check engine light. Your car has told you something is wrong,
a system inside of your vehicle has screamed
I am broken or something is not correct. Here,
tell them something is wrong. That's all you generally get with an
alert. You get a pointer to an issue.
Now that's valuable. If you're driving down the
road, check engine light comes on, you're probably going to be more hyper
aware of your environment and maybe start to
think about taking action. So you pull over to the side of the road
what's the next step? Well, the alert itself, unless you have
smoke coming from your hood and you know you're overheating or there's
something else obviously obvious that you can address. You don't
know enough to fix the problem, so you need help. Well that's
incident response. Incident response is your roadside
assistance. It's who you call to come and
support you in finding the problem and resolving it.
Now maybe they get there and they can jump your
battery, they can replace your battery, they can give you coolant,
whatever it is, maybe they can fix these problem. That's ideal
scenario. Or maybe with their
expertise they understand that this has escalated and they
need to bring in help. How do they do that? They take you to the
shop. So you go into the shop and they plug
you into a computer and they get diagnostic codes.
Or in the case of these Tesla, they get a whole flood of information
coming out of the system. So incident response
is that whole lifecycle of mobilization and you
want it to happen as quickly has possible so you can
resolve the incident. Well where does that put
incident management? Well the foundation of incident management
came from ITSM practices and it's driven
with a ticketing system. In the states we have
something called Carfax. Carfax is the equivalent of
incident management. It is a history of all
the things that has happened to your automobile. So once your
car has been brought into the shop, a Carfax record
will be created, an incidents will be created, and it will be historical
details of everything that has happened, what happened,
and potentially you can use that information in the future.
That is what incident management is and should be.
Although many organizations will try to use incident management and
ticketing as their incident response tool, and that's
not the most effective way to do it. And the reason it's not
the most effective way to do it is a lot
happens in the lifecycle of an incident,
from the alert to the operations of an incident,
which is not just an alert. As a matter of fact, an incident can
be many alerts put together to
paging escalation, getting to the right person at the right time,
giving them all the context, and then finally moving
that into maybe chat ops or a ticketing system or
whatever it is to keep that system of record.
Now if this view didn't look complex, this view looks even more
complex. This is an entire architecture of the lifecycle
of an alert, from some sort of system,
a tool, to creating an incidents
and mobilizing a responder and
potentially taking action within side the
incident itself. So what we get with an incident
over alerts is less noise, hopefully depending
on the nature of your systems. I still am blown away by the fact of
the number of systems. I see where they're like oh yeah, we get
1000 alerts a day and we just kind of ignore that.
Why don't you go fix them? That should be your first effort.
But the idea of an incident is less noise
because incidents are aggregation of alerts.
There's rules and logic associated of when something
actually becomes an incident, when it is significant enough to
be an incident. Incidents also give us more context.
Alert are triggering off of something
very specific in an application or these infrastructure.
So the amount of detail is very limited. Where an incident can be based
on a history of incidents can be linked
and correlated with confluence articles,
run books, et cetera. Which gets us to the
point where incidents can drive action. Because there is more context and
detail as a part of the incident, then there can be greater
action and greater execution as a part of
that. Now again, all we're really trying
to do here is make the life of on call suck
less, which is great because it also makes the application
run better. So in a typical incidents
lifecycle we have the process of acknowledgment.
You touch a whole bunch of tools, you might reach out to a whole bunch
of people. Usually there's two methods of interacting
with other people. The first method is what we call spray and pray,
where you blast out to the entire organization. You hope somebody picks it up.
Very common in the knock type environment.
And it used to work okay, but it was never super effective.
What happened is there's a lot of back and forth between people. You're touching a
lot of tools to find the information you need and the amount of touch
points was tremendous. So it usually lasted about
6 hours and five reroutes, et cetera.
This is totally based on the nature of your application, but this is actually a
consistent average we have seen with customers.
Then once you get to the source of the data, you found the right person,
or maybe you do what's called lazy mobilization. These you always find that one
person who fixes everything, but you quickly are
on the path of burning them out and you start towards
resolution. Resolution is not just,
oh, we need to restart the service, resolution is,
we restarted the service and now everything is green. Because usually,
and especially in a microservices world, the source
of the problem will cascade into other services and cause
issues down the road. And it takes time for those all to come online.
So with true incident response, and that was driven by
a ticketing system. We're trying to normalize and flatten that curve.
We're trying to get acknowledgment to be as quick has possible. That's the mobilization
roadside assistant. We're trying to touch as few tools as possible.
That's context, understanding, getting to the right person,
and then having the tools like run tools to resolve
the incident has quickly as possible.
So this is all how this comes together.
Our observability practices, which is monitoring,
is what gives these alerts and the context
to our incident responding practice, which is mobilization and
action to our incident management and
ticketing process, which is our record
and tracking of incidents over a long period of time.
And another way to look at this is kind of this
hierarchy of knowledge or insight or success
ultimately, which is the alerts.
We get tons of alerts, and some of them are useful,
some are not. But these can roll up into incidents
where we have more meaning and we can start to mobilize
and do something with the alerts that is meaningful and
get that in front of the right person. Then through troubleshooting,
the incident links us directly to a dashboard.
And once we are at that dashboard, we start to get insights of what's actually
going on, where we may dive into a log at that
point, we may be going to a log in a microservices
environments. We may be looking at traces and spans
for the details of what's gone wrong.
Everything should be right there that we need to resolve
these incident as quickly as possible. Once we have insight and
we know what's going on, we take action. So we
want to compress the time between all of these steps,
and we also want to reduce these noise, which would be shrinking this
way of all of these steps. So if you
came here thinking that incident response alerts and
incidents and tickets were all kind of one and the same,
hopefully you now understand that they are not. And what
I'm encouraging you to do is if you have not established a practice
of incident response versus just
being your monitoring tools and staring at dashboards, and hoping
that at some point in time something interesting happens
on your dashboards, or you're using a ticketing system, and hoping
that somebody is really fast at typing tickets and they
put enough detail in there that somebody will grab it and resolve the incident.
Building a true incident responding strategy is necessary
for the SRe practice. It's necessary for any sort
of cloud operations, because the nature of applications in
the cloud have changed. They're distributed. You don't have the billy bob
server that you can always go to to resolve the problem
in the same way that you always resolve it. You don't have the 1998 Honda.
You are now driving the Tesla. Congratulations. Now you
have to support it in a different way. So hopefully that
was compelling. Please again, reach out if it makes sense.
And thank you so much for enjoying me.
Hopefully you enjoyed it. Join me at
another virtual event. The content here is fantastic. I know
there's a ton of information out there, so for you to take a little
bit of your time and spend it with me means a lot.
So I hope to see you at another event and have a great day.