Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi everyone. Thank you so
much for joining me. This talk will be about incidents management. Talk the
talk and walk the walk. When I was in high school, the common belief was
that if youll actively listen in class, you'll have 50% of
the exam prep already in your pocket. I want to show you how
I adopted this belief to an actual proactive approach that you
could take that will help you manage incidents more efficiently, in a
more structured way and eventually preserve much needed hours of sleep.
So first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer
and I work for weeks. I have 15 years of experience in the tech
industry, which means a lot of production incidents. Recently I have
joined the AWS Community Builders program.
I live in Israel and I help organize events,
DevOps events like DevOps Days Tel Aviv and Statcraft monitoring
conference. I'm a mentor in courses and communities,
including communities for women in tech, specifically technical
women in tech. I'm a DevOps culture fan. I think this is what helps companies
achieve great things. And I'm a lead singer in a cover band, as you
can see in this picture, which is a lot of fun. Okay, so today,
what are we going to cover today? Incidents management
in general. The necessary flow for me,
the structural flow to take the mindset that you
should have while dealing with production incidents.
And how can you be proactive and come prepared for
incidents. So let's start. So, first of
all, incident management is a set of procedures and actions
taken to resolve critical incidents. And it's
basically an end to end process that defines how incidents are
detected and communicated, who is responsible to handle them,
what tools are used for investigation and response, and what steps
are taken for resolution. And the
thing were with incidents is that we need
to reframe our perspective. Because walking enough years in
the industry, we know that everything fails, right? All the
time. So since failures are given, we can't be in an ad
hoc putting out fires mindset. We need to refrain
the mindset to be structured and
say that, okay, we know that this
is going to happen, but at least I'm prepared to deal with
it. So business mindset is needed to grasp
the overall impact of incidents and mitigate damages.
Because without incident management, structured process,
or even without handling it properly
in general, then we could potentially lose valuable
data. Downtime could lead to reduced production
and revenues, and the business could be held liable for breach
of service level agreements. Because as we know, each business
has its own sls defined eight nines, eleven nine.
So it's very important to treat incident management with all seriousness.
So that's why we need to reframe our perspective,
have a business mindset and have the incident management
be a structured process because a
structured process can lead to incidents production,
improved, meantime to resolution and eventually to cost reduction
since downtime was reduced or eliminated entirely.
So wait, a structured process of an incident,
but how could it be? We have a lot of unknowns.
Sometimes it's incident x from that reason, sometimes it's that.
So how can it be structured if it's not consistent?
So years it youll be a structured process so there are
pillars that you can follow through.
I'm going to cover each one of them, identification and categorization,
notification and escalation, investigation and diagnosis,
resolve and recovery and eventually incident close.
And what you should do then. And I also put
some reference link here for an article by
on page. You can deep dive about these pillars later
on. Okay. So during an incident you should really
keep calm and ask yourselves, and I'm also going to address
the keep calm further on
in the presentation. So you should really ask yourselves
these questions. First of all, in the identification and categorization
pillar, do I understand the full extent of the problem? If so,
awesome. I can ive right in and notify people if
I need to. Because sometimes if it's a
crucial issue, I know that I need to update person x or
even customer success to alert users
that the application is down or not.
Depends on the issue and the full extent of it. And if I don't know
or don't understand the full extent of the problem, then I should gather more information
that will help you, help youll help me
understand what's going on and what needed steps I
needed to do based on that. Next up is can
this wait and be handled in business hours? Because maybe you
got paged by 04:00 a.m. But maybe it's not that important
and the alert is falsely labeled
as critical. But it's actually not critical, it's minor.
So we should address that. And if we're not
sure if this can wait for business hours
or not, we should really ask and we use
the information that we gathered in order to understand that and escalate
if we need to. And also if
we saw that the incident is not really critical, it's minor,
we should really change the severity or the runbook accordingly.
And also another thing that we need to check is that
was I notified about this alert or this issue
from the proper or expected channels? Because if so,
awesome. And if not, maybe I need to not. Maybe I should add
note to self to fix that, because if I heard about an issue from
a user complaint and not from petrol duty or obstining
or stuff like that, then we should really handle that.
Next pillar is notification and escalation. So who should
be notified about this incidents? Here we have two routes during
an incident and in general. So during the incident
you should decide by incident importance. Because again,
if it's critical, if the application is down and affects a lot of
users, then we should alert the support or the customer success teams
to communicate to customers if need be.
And in general, maybe there are other teams or key focal
points that care about the system and we need to keep them posted about the
system's health and status. The next pillar
is investigation and diagnosis. So what information is relevant
toward incident resolution? Because we should really focus
on what's important and what's relevant and put the
unimportant stuff aside because focusing
on the non relevant information will throw you off route and
make you lose valuable time. When youll deal with
an incident, doesn't matter if it's during business hours or not,
it will lose you valuable time. So you should really focus on what's important
right now. Did I find the
root cause? Do I understand the root cause of the issue?
If so, awesome. I can progress according to that.
But if not, I should investigate more. And if
I see it takes a lot of time, I should really escalate to
other team members or team leaders or other teams to help
me understand the root cause. Because we don't want to lose
valuable time and have the system in
downtime. If I could prevent that by just asking for help.
Right? And also we should really prioritize root cause
over surface level symptoms because let's say
I got an alert service is down on
silver x. Okay, I started it. Nice.
Then it happens again. And then it happens again. And then I shouldn't
just start the service and go back to sleep or just continue
with my day. If it happens. I need to check what's going
on, right? I need to check the root cause of why the service got stopped
and what's going on. Because our focus is
the environment, the system's health. So we need to make sure
that we know what's going on and not just fix, not really
fix, but put a band aid over
the scenario. And the last pillar is resolve
and recovery. So which possible remediation step is
the best one to take? Maybe I found the issue. Maybe there are
a lot of stuff I can do about it right?
So the thing is that you should choose the fastest
solution to eliminate downtime without compromising
the system's health and stability. Because, yeah,
we all want to go back to sleep when something happens because
we care a lot about our quality of sleep,
but in this point of time, we should really care
more about the system's health. So we should do whatever
is good for the system's health because it will
bite us in the if we don't.
And this is what were here for, right? We are either DevOps
engineers or sres. What is SRE? Site reliability
engineer. If I don't care about or don't take care of the site
reliability, I'm not really doing my job.
So there's that. Next up is are
there any action items needed after the issue got
resolved? Maybe it was the middle
of the night and there was really time to
go to the developer's code and wix the
issue properly. So maybe there was a patch done.
Okay, if there was a patch done and management knows about it,
everyone knows about it and agreed that it should be done. At that
point, all good, but we should permanently
fix the issue because again, we want to, a, prevent a
recurring issue and b, we want to make sure the system health is
good. And if it's just a patch and not a permanent solution,
then probably the system health is not that great.
And last but not least is closure. So once the incident
got resolved, do I need to notify anyone on this incidents resolution?
We need to be end to end communicator. So if at the beginning we alerted
customer success or support teams that there's an issue,
we now need to tell them, okay, issue got resolved. Please make
sure that, a, you communicate it to the users and b,
let us know. Maybe we think the issue got resolved,
but something happens and they still experience issues.
Right. So they are our QA of
some sort to make sure that everything works okay.
But they also need to communicate the users that now
the system should be back to normal. Were alerts
okay. Or they need tweaking because as I said, maybe we
got an alert in the middle of the night and it's not that critical.
We need to fix the alerts and tweak it. So we should do
that. Is a relevant incident runbook in place?
If it's outdated, maybe it needs to be updated.
Right. So runbooks are things that
we have during an incident or should have during an incident that helps
us resolve an issue usually, or mostly
when we need to have comes sort of judgment on an incident.
So let's say, if this happens, I need to do this,
but unless the other log shows x, and then I need
to do that, right? So there are a lot of things and scenarios
where judgment is needed. In that case, we should really have runbooks
in place. If we don't have runbooks in place, please,
we all need to write them down. And even if we do have runbooks
in place, we need to update them to make sure that they are up to
date. Could I help prevent
similar incidents from happening again? Maybe I noticed something that could
be tweaked or changed or fixed or edited in order to
prevent similar incidents from happening. If so, create a
task for you in Jira or Monday or whatever tool that you use,
and help prevent the next incident from happening.
And also, of course, does this incidents require a postmortem
if years, then okay, jot down the notes as soon as possible while
it's still fresh in your mind. Because I think we all
know that we are human beings and we remember
things better once it's still happening, and not like after
an hour or two or even the next day. But just to
make an emphasis on that, were was a study conducted by Blumar
Zeigonik, who's a russian psychologist. She found, but that we
remember more details during an ongoing scenario rather than upon
their completion. So, in favor of a
better post mortem process, write the notes down as soon as possible.
And if there isn't a need for postmortem,
still share the knowledge, either through a runbook or
a daily brief, or even do a mental check with yourself to
make sure that everything was handled as smoothly as
possible. And if not, what could be done better?
Okay, let's talk a bit about war room conduct.
So, war room is when you have, I would say, more than
three, four people handling
an issue, right? So in that case, we should really
have an incident manager that really
divides the work and tell people what to do. This person should be
calm and collected and see things clearly
and not afraid to reduce people's involvement if it doesn't serve the purpose.
Because if, let's say we call this guy to help
with a certain thing now, this certain thing finished.
Okay, now, guy, please go away. Because we don't
need the extra noise, because too many people can be too noisy.
It should be kept minimal and dynamic.
And there's that. And I want to tell you even a story about
that. On one of my previous jobs,
I was new at the company, and there was a critical
AWS issue that created a lot of bad stuff
for us downtime not good stuff. And I was
new, right? So I didn't speak up because I didn't think that
I have something to contribute because I don't know anything yet.
I'm new, but I was, were in the war
room like it was on Zoom. So I was just there quietly and
I saw that they are just going places
that are not really helpful. He pulls
this rope like this way and
he thinks about that and he talks about that and everyone just
doing their own thing and not really coming together. So at that point
of time, I jumped the gun and said, guys,
I don't see that this is coming or going anywhere.
Let me help. And then I took the liberty to be the
incident manager. And then I told this guy, okay,
you check the logs, check x. I don't see that we have
a runbook for a proper startup of
the application to see the flow that needed
to be in a specified order. So please write down
the process for us to start the application
in the order that is needed. Right. So I took this
role upon me and then stuff started to really
progress towards resolution. So incident manager is
very needed in a war room conduct because we
need to have organized way of doing
things, as always. Okay. And speaking
of an incidents manager, there are a lot of things that you should
have qualities that you should have when
you handle an incident. It doesn't matter if it's in a war room or
on your own, but there are a lot of qualities. Of course there are a
lot. I'm not going to tell everything here
or mention everyone because I can't mention all qualities, but let's cover
the ones that I think that are important and some tips
for me how to perfect them. So the first one is
think on your feet impromptu action
taker. So sometimes the
issue will be something that youll are familiar with it, but sometimes it
will be something in an uncharted territory. And you need to think on
your feet and be ready for anything. And in order to practice that,
you can participate in brainstorming sessions at work.
So whenever possible, you can jump the gun and participate
these sessions because these kind of scenarios of ping pong
will help you practice this quality.
Next one is differentiate between relevant and non relevant
information. As I just mentioned before in the
war room story, he said that,
he said that. And people talked and talked and talked.
I'm like, guys, we don't progress towards resolution. We need
to really focus on what matters right now.
So that's very important trait to have
to differentiate between what's important for fixing
the issue and what's not. And basically, the more
youll know how a system works, the more your ability
to separate the relevant from the non relevant information increases
operation under pressure. So let me also
tell you a story on another job
that I was there, I joined.
I also was new at that position as well. And I
had my first on call.
My first on call were, I forgot the word, my first uncle. And there
was a big issue. I mean, a lot of alerts on the screen.
Like 100 alerts. It was crazy.
And then I looked at the screen, I'm like, okay, I'm new. I don't know
what to do yet. Let's call the guy that's there for two years,
and he will help me, right? Because he knows what to do.
He's familiar with what's going on there. So I called him, and then
he sat next to me, and I'm like, okay, now what do we need to
do? And then I remember he just looked at the screen. He was like,
wow, there are so many alerts.
And I'm like, dude, dude, snap out of it.
So he was totally out of it. And I'm like,
guy, dude, it's not helpful.
We need to snap out of it and see what we can do
to fix that. Right? And the thing is that
stress is a symptom of being out of control, and collection
of relevant data will help you decrease stress
levels. So when you know what to do, you're in control.
And in general, you should really keep a cool head.
Snap out of the uncertainty cloudiness, because when
it comes to multiple participant incidents,
it's also something that you should think about, that the
stress level increases because everyone stressed and they want
to fix the issue, right? So keep a cool head and
just start to gather information that will help you
solve the issue and then regain your control.
A methodical work. So time
is of the essence, right? And there is a pressure to solve things fast,
as I just mentioned in the previous bullet.
But the thing is that methodical walk will help you gain faster
incident resolution. As I showed you before, structured process,
follow the rules, follow the questions that you need to ask yourselves,
and then it will help you regain control and progress
you towards faster incident resolution. Be humble.
If you're stuck, ask for help. So it's okay not to
know how to fix an issue on your own. That's okay.
But you need to understand that it's not your time to shine.
People say, I will fix the issue. I will be the hero,
and that's that. But no, your time to shine will be if
you help the company not lose
money and not have downtime, right. So you will have a lot of
opportunities to prove yourself on your day to day. The best way to prove
yourself on instance is to take a step back and escalate an
issue. If you don't know what to do and you can't resolve it on
your own, because in that way you have the business interest in health.
Remember business mindset, it's exactly that.
Problem solver. So if you have a problem solver approach,
whatever needed and can do approach, youll can basically do anything because
being positive is the way to go. And if you start from that
point and not be negative of like,
I'm not sure it can be salvageable or whatever,
it means that your ability to do stuff increases.
So always have a positive can do approach, sense of
ownership and initiative. So if
you're on call and you escalated something to other person, that's good,
right? We just talked about it, but you are still on call. It means
that if you escalated something, you're still responsible and you
need to have end to end handling of things.
So after escalation, wait ten minutes,
15 minutes, whatever it takes, and then ask, hey, what's going on? Do you need
help? Do you know what to do? And be sure that you know
what's going on and it's really handled because maybe you escalated. But the
other person, I don't know, he didn't understand it correctly
and he's not really handling it, and nobody's handling the issue right now.
So communication is very important. And make sure that if you escalated,
this is really handling, someone is really handling it,
and it's an end to end process for that
good communicator. So you really need to explain
the issue to others that will help you and communicate the issue
for escalation purposes. So being a good communicator is very
important and communication guidelines can be established.
So let's say you're not good in communication, you're not great with
that, you don't know who to talk to, or you
don't have the tendency of updating people, right? So if
your company or your department sets communication guidelines,
then you will know exactly what channels should be used and what
content is expected in those channels and how communication
should be documented. And if you have this laid
all out for you, then you know exactly what should be communicated and
it will help youll be a better communicator and lead
without authority. It's mostly relevant on a war
room scenario with more than two people involved.
And remember that if you're nice and confident and you
make people feel at ease and project and everything under
control facade, then people will listen to you and follow your
lead. And I think that the most important thing is
caring. You need to care about what's going on. You need to care
about production. You need to care about your team
members, your company. If you care, then you will do the extra mile
and you will be able to do anything that
I mentioned here. And the structure process, and also the proactive approach that
I'm being to show you right now. Okay, so we
covered the mindset that you should have the business mindset when working
on production and handling production incidents. We covered
an incident flow, a structured one that will help you handle
an incident better. Now let's talk about being proactive.
The proactive approach that you should have in order
to come prepared for an incident that will happen. Because as
the song by the Fujis. Right? Ready or not, here I
come. You can hide. So if you're not ready,
it doesn't matter. Page of duty or opgeny or victorops or whatever
tool that you use will call you anyway when you're on
call. So you better be ready. So how can we be
ready? Right? The proactive approach after the fact, after an
incident took place, looks something like that, in my opinion.
So first of all, on call shifts handoffs, I'm not
sure if it's something that is done on every company,
but let's say I finished the shift. By the way,
this is the word that I looked for before
on call shift. So after an on call shift,
there were several issues. If they were minor, then okay,
but if there was something special or something recurring,
I should document it in an on call shift handoff, which is
a summary that I will post in my team's channel.
And then the uncle after me can read what's going on,
and that way he or she or they can be updated
on what's going on in production. And it will help them
have a better shift on their own because if they will have
an issue, and the issue was basically recalling
because I had it, then they would know better how to handle that.
So it's good for audit purposes because it is documented
in the Slack channel, but it's also good for your team
member success because you want to help them do their job better and have
a smoother shift.
Postmortem notes. So as I mentioned before, write them down as
soon as possible. And even if there's no meeting,
do a mental check. Do a retro with yourself, see what you
could have done better new tasks.
So again, prevent the next incidents. Do you have
something in mind that could help, based on what you saw in the incident
that could help stabilize the environment?
Open a jira or a Monday ticket and fix it to prevent
the next incident? Modify alerts so maybe you
saw some false positive alerts, and I think we
all seen it in our career, alerts that come up and after
a couple of minutes get closed. So don't just leave that,
right? And don't just wait for the next on call to
fix the alerts because maybe they will wait for the next on call and they
will wait for the next on call and then it will never get
happen and we all will suffer from these alerts.
So please fix them. Internet runbooks so
I mentioned it before, if you don't have Internet runbooks
in place, please write them down and update
them along the way. And this will help you to have a
smoother process, right. Because you are already prepared, you know what to do in
a certain scenario than certain issues.
Automation. Let's say you found some candidates
for self remediation, some issues that could be self
remediate by the process or the flow itself.
So if so, open a ticket and make it happen.
And if the issue handled, share the knowledge. I mean, people could really
benefit from your line of thought and how you fix things.
And this share of knowledge is more in depth than in an on
call handoff, because in an on call handoff, you just write
down a summary, were, I mean, actual share of knowledge
to show people how you figured out things. What was
your line of thought? What was the flow that you had? It will really help
others understand what's going on and come prepared better
for incidents. So we covered the proactive approach
for after the fact, after an incident took place. Now let's discuss
what youll can do on your day to day that will help you come prepared
for incidents. So, first of all, the onco shifts,
handoffs that I mentioned before, you should read everything, okay?
Not only the shifts that the handoffs
that the person after you wrote,
but your entire team, it shouldn't take long. It just should
be like a paragraph. And this could help you understand
what's going on in production when you're not there, when you didn't do the changes
yourself. So it's very important because that way you will always be
up to date with what's going on in production.
Escalation, a point of contact. So you support
several services at work, right? And you know the
needed pieces of information related to your realm. My realm is infrastructure,
so that's great, but you should also know
other realms as well to have the full picture. So let's say
there's an issue with X. I've checked my side of things,
I don't see any issue, but I know that John is the one
that is handling X from the side of code, from the
side of the developer side. So I should escalate to him to check things
on his end. Identifying services, escalation points
on a day to day basis, and not only ad hoc
when the occurs will save time and money
on incident management and save someone else's hours of sleep.
Right? Because if I don't know who's handling a developer side
of service x, then I need to wake my team member. I need to
wake my team leader and ask, hey, who's responsible
for that, right? So not nice. So I
can prevent that and already come prepared and know that these services are
handled by these guys or these women or whatever.
And then it will save time for me during
the incident. And other than just
chasing my tail and figuring this out on the spot,
I know exactly who can I escalate the incident
to. Understanding system architecture. So if
I know weaker areas in the infrastructure or
in the code maybe, and vulnerabilities and sensitive or
blast radio scopes, then to help me understand the
severities of incident, to help me understand what
needs to be done, either escalation or root cause analysis.
So understanding and really learning how our
infrastructure works and its vulnerabilities will really
help us come prepared for any incidents.
And coupled with that is learning application flows, because that
way we know the business impact. If something bad
happens, we have a service, we know if this is a service
that affects a lot of users or maybe a few. So business impact
is very important in that case and also for escalation purposes.
If I know application flows, I know that this service communicates with this and
goes to this and goes to that, then I can do
a root cause analysis and go by the flow and see,
okay, these logs looks okay here. It's okay.
Oh, were I have some issues, if I don't know the flow, I wouldn't be
able to go in this path. So learning application flows is very,
very important. Team members tasks. So as we
know, production happens not only by me or by
you, but your team members also contribute
to the production changes that happened.
And believe me, it really is easy for me to just lay low
and deal only with my tasks. But I'm responsible
for production. I need to know what's going on so I need to know what
my other team members are doing and what changes they introduce to the
environment. Because I'm responsible for the environment and I need to know what's going
on. So it's very important. And again,
coupled with that deployments or changes in production,
so ask about the changes and their possible impact.
And as I said before, the previous slide ops
genie or pagerduty doesn't care if you didn't do the
deployments or the changes by yourself. It will call you anyway if
you're on call. So you better understand and know what happened
in production in order to handle incidents better.
And last but not least, be a go to person
as they say. If you build it, they will come.
If you are a person that is known to
be proactive and know what's going on in the
system, people will come to you. Youll get push notifications
and it will decrease your need to fetch the updates on your own because people
will come to you. So there's that.
And I say that in order to talk the talk
and walk the walk when it comes to incidents management, if you have your qualities
in check, so if you know that you're going to be stressed out,
work on that and other what is
in check, make the process structured and place.
Then prevent
the next incident from happening. And remember, less incidents
means less downtime means business success and business success
is eventually your success. So thank you so much.
If you have any questions about incidents management or
any other SRE topics, I will be more than happy to help. Thank you
so much.