Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, my name is Amir shaked and I'm excited
to talk to you about a very important part of incident management, which is
what do you do afterwards? This is how we can
use effective debriefing to create both a learning culture
and improve the stability of our systems. Over time.
Every production environment fail, we too severely
failures all the time. The question is, what do you do afterwards?
Today I'm going to cover our journey through the process
of change towards creating a healthy and supportive learning culture
by taking those failures and building on those
every production system I worked on experienced issues,
sometimes due to code changes, other times due to third
party providers having their entire infrastructure crashing,
leading us to seek ways to improve constantly
on how we do things, how we protect the system,
and at the end of it, how do we provide a better service to
our customers.
When I joined the company, we set a destination of wishing to see
rapid deployment and rapid changes, being able to
provide the most adequate and up to date solution.
In our case, it's a world of moving target defense
where the scope of features and changes all the time due
to the threat actors. So being able to deploy changes quickly
is a major factor in our ability to provide a competitive
product. In fact, oftentimes a
good DevOps culture can be the differentiator and
act as a competitive edge for your company. We wanted to
have zero downtime and have errors or mistakes happen
only once. We will have a chance to learn from them, but never
twice.
However, the starting point that we got off
to wasn't that bright. We saw a few
repeating issues, for example, minor things
causing failures in production that shouldn't have been happening,
things due to code changes or configuration changes,
being tools prone to incidents due to the underlying
cloud environment we were using affecting our stability.
Now, those two factors were very concerning at
the time when we looked at where we are and how fast we're expecting
to grow and scale. Since what is
a minor issue today can potentially be
catastrophic, since the future will come
in very fast when everything is
growing very rapidly. Now, while those
two were concerning the real issue, the last item was
what was preventing us to improve the most,
and that was fear of judgment. Every time we dove
into trying to understand an issue that we had, there were pushbacks.
Why do we ask so many questions? Do we not trust people?
Why don't we just let them do their job? They know what they're doing
and that's a problem. If you have a team members afraid or
feeling that they're being judged or just generally insecure in
their work environment, they are going to underperform.
And as a team, you will not be able to learn and adapt.
With that starting point and the destination we had in mind,
we set off to establish a new process within the team
of how we analyze every kind of failure and
what do we do during the analysis, how to conduct debrief
that we're talking about and the follow up, which is
just as important. Now, why am I focusing so much on
the process? And the reason is a bad system will beat
a good person every time, just like this great quote
says. And assuming you have the right foundation of engineers
on your team, if you fix the process, good things
will follow. So I'll
start with an example of an issue that we has.
We had a customer complaining that something was misbehaving in
their environment and they thought that it might be related to
us. So they called support. Support was trying to analyze
and understand and after a while they realized they don't know
what to do with it. So they paged the engineering team as they
should have done. Now, the engineering team woke up
in the middle of the night. Since they are in a different time zone,
they work to analyze what's going on. They found the problem.
They fixed the problem with a bit of resentment
of having to do it in the middle of the night and then going back
to sleep, moving on to other tasks the next day.
Now, if this is the end of it, on the next day,
we are certain to experience similar issues again
and again from other similar root causes in the future.
So we did a debrief and I'll explain in a bit
how we did it and what exactly we asked, but apparently
code was deployed into production by mistake. How can
that be? An engineer merged their code into the
main branch. It failed some tests and it was late,
so he decided to leave it as is and keep working on it tomorrow,
knowing it will not happen, knowing it won't be
deployed. What he didn't know was there
was a process added earlier that week by a DevOps engineer
that automatically deployed that microservice when
there was a need to auto scale. And that night we
has a significant increase in usage,
spinning up more services automatically with
the new features of the auto scale. And that came with the buggy code.
Therefore, we can ask, why is this happening?
What can we do better? How can we avoid seeing such an issue and
any potential similar issues in the future?
And the way to do that is to set time to analyze
after the fact and to look for the actual root
cause that initiated this chain
of events, lessons can be taken and learned from back
to our example. We can focus a lot on why the
buggy code was merged into production or why the auto scale
was added and different engineers didn't know everything
that has changing. But if we focus too much
on why somebody did something,
we will miss the bigger picture, which is wait
a minute, the entire deployment process was
flawed. How can it be that merging
code into production doesn't ring a bell that it's
going to be deployed? And when I'm saying merging into production,
I mean merging into the main branch. Every branch or every repository
names should have some kind of meaning and
that was the root cause in this case. So if you fix the process,
and again, going back to our example, aligning all engineers,
that main branch always equal production regardless
of it's being automatically deployed or not. That will
change the way they approach merging that code and
how it's tested, and not
leaving buggy code in the main branch
when it happens. So if you fix the process and we don't
overjudge why a specific employee did or didn't do something since they
were just trying to do their job, we will
have an improvement later on.
So how do you learn from the incident? There are four steps.
It starts with the incident and the more features you become has an organization.
The team will create an incident from supposedly minor
things just to follow up and learn from them.
And you provide the immediate resolution to the issue and
that obviously will change on every issue. Then comes the process of
24 to 72 hours afterwards, really depending on
how much time they had to sleep or if it has a weekend in the
middle. You're going to do the debrief and we're
going to talk about what happened exactly within that meeting in the next
slide. And then two weeks after the meeting,
after the debriefs is being conducted, we do a checkpoint
to review the action items that came from the debriefs, make sure things
were incorporated, and this is especially important for the immediate tasks
that came out of it, and a follow up to see that they were implemented
straight on. Now, what is so special about conducting
a debriefs? We need to remember it's not just a standard retrospect,
since it usually follows an incident that may have been very
severe or had a lot of impact and tensions are high,
people are worried or concerned, maybe even judgy.
So we need to take that in mind when we're going into the meeting.
When do we do the debrief? Every time we think the
process or the system did not perform well enough.
So any incident is a
good case to pick on and try to learn from.
We ask a lot of questions. The questions I like
to ask is, first of all, what happened? Let's have a
detailed timeline of events from the moment it really started,
and not just the moment somebody complained or the customer raised
the alarm, but from the moment the issue started. Going back to
our example, it's when somebody merged the code.
That's when the story really started to unfold
on that day. Or maybe
the third party provider failed or something changed in the
system. But try to go back to the most relevant
starting point, not the Terry history, obviously.
What is the impact? Also, a very important factor
in creating a learning environment is trying to
understand the full impact. It helps convey a message to
the entire engineering team. Understanding the
impact to cost or customers affected, or complaints,
getting a full scope as big as you can.
It's vital to help everyone understand why we're delving
into the problem and why it is so important. And it's
not just that they woke up or we disturbed their regular
work schedule. It really brings everyone
together on why is it important to find the
root cause and fix it. Now,
after we have the story and the facts, we want to start to analyze and
brainstorm how to handle that better in the future.
Now, the first two questions here I like to ask
is, have we identified the issue in under
a certain amount of time? Let's say five minutes.
Why five minutes? It's not just an arbitrary
number. We want to have a specific goal on how fast we do
things. Having that in mind helps understand if
we achieved it or not. So, did we identify the issue
in under five minutes? Sometimes we did, sometimes we
didn't. Did we fix the problem in under an hour?
Completely fixed it. Maybe we fixed it under ten minutes.
Do we need to do anything at all? Maybe it was completely resolved automatically
and there was no point of us on us trying to analyze anything.
Once we answer no to any of these questions,
the follow up should be okay, we understand
the full picture. What do we need to do? What do we need to change?
What do we need to develop so we'll be eligible to answer yes
to those two questions of identifying and fixing.
Now, this part, while seemingly simple,
led to a drastic culture change over time. It set a
framework in the way it helps convey everyone
on what is the most important thing. What should we
focus on? That was the process and the system. It's not
just about anyone specific. Whoever caused the
incident is irrelevant. Tomorrow it can be somebody else.
So it's the repetition and the
specific question, the way they are constructed,
that really helps drive a
process of change and learning.
Now, any culture change takes time.
It's never simple. We has some things
we had to resolve along the way, and I already mentioned a few
of those throughout
this talk, but I want to jump into a few more.
First is lack of trust. You have a new manager on board.
He's coming in, trying to instill new process, trying to change the
culture. It takes time. A lack of trust could be
in the process, could be in the questions people
could ask themselves. Maybe. Is there a hidden agenda behind
that? Perhaps. Maybe it's just another way to do the
blame game, but in a different format.
Now, this can be completely avoided and resolved if we
do it properly and consistently.
Another thing that can happen when you're trying to understand a problem
is people saying he or she did something and
they are at fault. Going into the blame game. When you
see that happens, you need to stop that direction
of discussion and put everybody back on track of what you are trying
to achieve. And what were the specific questions we are asking?
It's not why they did something, it is what happened
and what do we need to change?
Not following up on action items, that's something that
can really be annoying. We do the process, we do the
review. We set up action items to fix everything to
get better, and then you have the same problem all over again a
few weeks or months later. How did it happen?
Basically, we set the action items, but nobody
followed up on them. So the resolution there might be very
simple. We did the checkpoint, we did
the debrief, and then we set a checkpoint every two or three weeks or whatever
time frame is relevant for you to make sure those items
get handled. Also,
what we specifically did was label each Jira ticket with
debrief so we can do also a monthly review of all the debrief items
to see what was left open, maybe relevant. Maybe something can be
moved, pushed further in the backlog.
But it gave us the ability to track if things were actually being followed
up on or not. Another critical
move we made was implementing
proper communication on a wide scale. You need
to make sure everybody knows that there was a debrief.
We were publishing all our debriefs internally, very widely
inside the company, exposed to everyone with the details of
what happened and what are we going to do to make it better. This helps
bridge the gap of trust and show that everything is very
transparent and visible trying to
fix problems. We also saw that
if we're not asking the right questions during the process.
We might be narrowing tools much on the
problem and we should look broader at how things
could be connected and relevant in a bigger picture. So not just focusing on smaller
things. Again, using back to the example that I gave,
not focusing on the repository
that was changed, but asking do we have
other repositories? Is this a global issue of
any repository that we manage?
Now there are four main things I would like you to take
from this session of how to conduct a debrief besides
the questions and the repetition. And it's these four.
The first is avoid blame. Avoid it altogether,
and if you see blame starting to happen within the meeting, you have
to interfere and stop it politely.
Always be calm, but you need to stop it to make sure it's on
track and meeting. Go the way you want it. Because if there is a vision
of how it's going to happen later on when you're instilling the change,
it will happen on its own without even being needed to be involved
in the process. And those meetings after a while
go easy on the why questions. There is the model
of the five whys. It's a risky one
because when you're trying to understand why
somebody did something, the more you dive into it. And if you
ask too many times why, you might create a
sense of resentment or self doubt for the employees.
It can sound like you're being critical and judging them
on how they behaved and why they were doing certain things.
Be consistent, crucial. The key
in creating a process and proving you really
do mean what you say is consistency on the
process, the questions and everything. And like
I said, keep calm. It's always important,
especially when you're looking at things that failed to stay
calm. Show there is a path forward. It always helps
in creating a better change environment.
Now these are some of the most notable learnings
we had from those processes over the years. I'll touch
briefly on a few of those, but they really are the highlight
of action items that we'll learn from along the
way. Also, you can scan
the template of our debriefs format in this
URL if you want to see it,
and I'll go over the examples briefly.
First one obviously humans make mistakes. You need to fix
the process and not trying to fix people that won't
work because they work hard, they're smart,
but everybody makes mistakes, so assume mistakes can happen.
Fix the process of protecting from potential
mistakes. Gradual rollout,
while often appears to be some holy
grail of microservices and large scale production systems
it's a great tool, but it's not a silver bullet
and it will not resolve everything. So saying it will
miss a lot of potential problems that even when you
have gradual rollout, you will still miss them. So don't try
to fix everything with this hammer.
Establish crisis mode processes very important.
There are levels of issues and when
the shits hit the fan, you want to know who to call,
who to wake up and how things should operate.
Having those crisis mode processes ahead of time
really helps everyone understand what they should do at the
moment. Feature flagging really
helpful in terms of handling errors quickly.
One of the things that you can do is disable a certain feature instead of
rolling out or rolling back an entire system,
especially when you have thousands of dockers.
Feature flagging and changing config on all of them can
be much faster than redeploying thousands of servers.
Now we always try to avoid replacing
something that we need to change since
a lot of components are loosely coupled.
So every time we want to change something, we are looking
into is it really a change or is it a dual change? If you're
adding and subtracting at the same time, you're actually doing two
changes at once. So try to limit that. Add something,
see that everything works, subtract later and see that everything
still works. Splitting those can really help focus and narrow
down potential issues. Looking for the ten x breaking point
also very important. Looking ahead, trying to understand
when can something fail and what will
be the breaking point of that specific part of the system.
And last but not least, treating config as code.
Configuration metadata, everything that
can affect the system if you treat it as code,
add the pr, review the deployment processes,
the rollback processes around it really helps stabilize
everything around the system. Thank you for listening.
It was a pleasure talking about this and and I
hope I gave you something to work with and a new process that you can
use for your production environment. Feel free to ping me
on any of these mediums and ask questions about
the subject.