Transcript
This transcript was autogenerated. To make changes, submit a PR.
I love tv series about firefighters. The firefighter
devotion and leadership skills inspire me.
I know it's totally different, but I see DevOps
teams as firefighters, maybe because they
first responders in production incidents and it
seems they are heroes that save the day. Then come the
developers to aid hi, I'm Einat
Mahat, an R D group leader at Sky Fighting Production
incidents. For over a decade, since I was a junior
developer, I was drawn to production intensive dev
teams later on. While leading such teams,
we had a lot of action and I felt like a firefighter's
lieutenant, metaphorically speaking, of course.
And then it hit me, yes, we are great, we are true heroes,
but some of the fires we actually set
ourselves. Ness reminded me of the arsonist firefighter.
I saw it in some episode and yes, it's a real thing.
In a nutshell, it's a firefighter that sets fires
only to become a hero. I'm not suggesting that any of
you set fires on purpose or that you do it
to become heroes, but I think that
as developers, there are several things we can and should
do to prevent the next fires rather than set
it and put it out. Fighting fires
for a living can lead to burnout and sometimes even
cause many trauma. In a formal group of mine where we
had major incidents, too often, one of the team members
still get anxious every time he hears his phone ring.
Let's talk about a few ways that might help
us to prevent the next fire.
Monitoring is a crucial component of incident
prevention. The purpose of monitoring is to provide visibility
into the performance, availability, and the general health
of systems. Monitoring is an art,
collecting and analyzing the data from various sources
and crafting the right dashboards that will have the
exact data enough but not too much
to provide visibility but not overwhelm us.
Investing in monitoring will be helpful for early
detection of issues before they become incidents,
enabling us to proactively avoid issues and
prevent incidents. Good monitoring can also help us.
During an incident, the teams can use
the dashboards to quickly identify issues and resolve
them and minimize their impact. Monitoring is
great, but the issues should be brought
to us via alerts. In push, not pull,
alerts can be a powerful tool for incident production
and response. By configuring alerts correctly
to trigger when certain conditions are met, they provide
teams with early warning of potential issues and
enable them to take actions before those issues
become serious problems. Alerts typically generate
notifications, which may be delivered via
email, SMS, or a collaborative tool
like Slack. Some alerts will be sent to the responsible
teams, while others will be sent to the engineer
on call depends on alert severity to ensure that the
person receiving the alert can take action
even in the middle of the night. The alert should provide sufficient
information when it comes to alerts,
same as in monitoring. The trick is to know
the balance between too much and too little.
We want the alerts to detect the issues but not cause
alerts fatigue. Furthermore,
having too many false alerts can make it harder
to take genuine alerts seriously.
A postmortem, also known as a lesson learned
or a debrief, is a common practice.
We discuss what happened in an incident or event,
good or bad, and learn from it.
The goal of a postmortem is to identify the
root cause of the incident, understand what
went wrong, and determine what can be done to prevent
similar incidents in the future. The culture around
it should be one of facts and learnings rather than finger
pointing. The postmodern process typically includes
three steps. The first is gathering data
about the incident, including the timeline of events, the impact
on users and business operation, and any other
relevant data. The second step is to identify
the root cause of the incident, what went wrong, and why,
and the third is creating action items
based on the findings and following up on them. By conducting
postmortems on a regular basis and implementing the
lesson learned from each incident,
teams can reduce the frequency and severity of incidents.
It's important to know that production incidents can still
happen even if you're not on call,
so it might be beneficial to read most
debriefs and learn from the experience of others.
Most of us know what postmortem is, but there
is also a premortem. It fits well when performing critical
changes in production, but not only before we go
to production. We gather the items at risk and
talk about how we can prevent them from catching
fire. The technique is to imagine that a project
has already failed and then identify the reasons for
the failure. This approach helps teams
anticipate potential issues and develop mitigation
strategies before they become actual problems
and prevent incidents. The process typically
involves three steps with the team and
stakeholders. One, imagine failure.
We are in production and there is a failure. What can it be?
Brainstorm potential issues and consider worst
case scenarios. Two, identify risks.
What are the major risks? Their potential impact
and priorities. Three, develop mitigation
strategies. What could have been done to
prevent this, or mitigate the issues? By conducting a premortem,
teams can be better prepared to address potential issues
and reduce the likelihood of incidents or downtime.
It can also help teams build confidence
in their change and improve the overall quality and
health of their systems. Rollout and rollback
plans are critical components of development and deployment process
designed to ensure that changes are implemented in
a controlled manner. The rollout plan usually includes
the order in which components will be updated,
the timing, and the required resources.
A good rollout plan will also identify potential risks
and include issues that may arise
and consider mitigations. There is a debate
about whether or not everything should run in gradual rollout,
but the rollback plans are often skipped. I think that
for critical changes, we should plan our exit
strategy. That's a rollback plan. A rollback plan
outlines the steps that will be taken to undo a
deployment and revert back to a previous version of
the system. It should also include a clear criteria
for when to initiate a rollback.
The purpose of these plans is to help ensure
that changes are implemented in a way that minimize the
risk of downtime. By having a clear and well defined
rollout and rollback plan in place, teams can be
more confident in their ability to make changes while also
being better prepared to respond effectively when issues
occur. Runbooks are a set of procedures we
write for common things that might break and how to troubleshoot
and tackle them. Runbooks are designed to help teams
perform tasks consistently and efficiently and
can be used by both developers and operation teams.
In general, runbooks contains step by step
instructions for carrying out common tasks and may
also include visual aids to help clarify complex
procedures. The main benefits of roundbooks are that they
help ensure that tasks are performed consistently
and accurately, even in high pressure situations.
By creating and maintaining runbooks, teams can reduce the
risk of errors and increase the speed and efficiency
of their operations. Creating solid runbooks is
helpful during an incidents, but we can also use it
to think about potential errors and prevent incidents.
I believe in continuous improvement in life
at work and specifically when talking about systems
and incidents. Continuous improvement in this context
includes regularly reviewing and refining
our process tools and systems to ensure that we
are continuously improving our ability to
prevent incidents and efficiently respond to them.
This involves conducting regular
retrospectives to identify areas for
improvement, implementing a process of regularly
reviewing and updating runbooks, monitoring and analyzing
incident trends to identify patterns and
root causes, and seeking feedback from users
and stakeholders to identify areas where
we can improve. By continuously improving our
processes and systems, we can help ensure
that we are always learning from
incidents and working to prevent them from happening again
in the future. We talked about
seven ways to prevent fires, and there are
many, many more.
Unfortunately, we will not be able to prevent
all fires, so let's talk about the
most important thing during an
incident communication during
an incident. There is a lot going on on the scene.
There are many people that need to work together.
I use these three guidelines to ensure clear and effective
communication during incidents.
Have an incident communication plan.
Keep all stakeholders informed about the status,
technical people, management, customers and
others. Decide who will be responsible
for communicating and what information will be
shared. Assign clear roles
and responsibilities to individuals involved in
resolving the incident. This helps to
ensure that everyone knows what is expected of
them and provide regular
updates. This helps to keep everyone informed
about the status. Use clear and concise language.
Avoid technical terms. When talking to
nontechnical stakeholders, it is important
to be transparent about the incident. You must
communicate. You have to update on your
progress and make sure everyone is aligned
well, we did not save any fires. We did
not even save a kitten stuck on a tree.
But from my experience fighting production incidents
called fires in this session, I can definitely
say we all can prevent the next fire.
Instead of setting it yourself. Don't be an arsonist
firefighter, be a true hero.
Thank you.