Transcript
This transcript was autogenerated. To make changes, submit a PR.
Jamaica make up real
time feedback into the behavior of your distributed systems
and observing changes exceptions. Errors in real
time allows you to not only experiment with confidence, but respond
instantly to get things working again.
Those maintaining
reliable systems today's focus will be on incident management and
more specifically, how can customer impact be minimized
by incidents management and postmortems? Before we
start, I would like to introduce myself. I'm Ayelet Sachto.
I'm a strategic cloud engineer and I'm also co leading PSO
efforts in EME and currently I'm an SRE in GKE
SRE London. Internet management is not new to me as
I'm living and bleeding production in large scale for
almost two decades now, most of them covering production
on call. And today I'll share a few things that
can make the life of an SRE or an on call a bit
easier. I know it would have made mine
much more so. What is an incidents?
Incidents are issues that are escalated as they are
too big for you to handle on your own, which required immediate
and organized response. And remember,
not all pages become incidents. Incidents mean
loss in revenue, customers data and more,
which all comes down to impact on our customers and our
business. We want to avoid having too many or too
severe as we want to keep our customers happy,
otherwise they will leave us before we drill down,
let's recap on our terminology. A service level indicator can
SLI tells you at any moment in time how
well is your service doing? Is it doing acceptably or not?
A reliability target for an SLI called an SLO
service level objective. Aggregate that over time.
It says over this window of time, this is my target and
how well I'm doing against it. Most of you are
probably familiar with SLA. Service level agreement
defines what you are willing to do. For example, provide a money
refund if you are failing to meet your objective.
To achieve that, we need our slos, our target, to be
more restrictive than our slas.
Our scope to examine and measure our user
happiness is user journey. Your users are using
your service to achieve a set of goals and the most important
one called critical user journey.
We know that failures will happen, but how
can we make them hurt less? How can we reduce the
impact? To answer this question, let's look
at the production lifecycle, the time that we are not reliable
and our users are not happy.
That time includes those time to detect, time to
repair, a time between failures. So to reduce
the impact, we can tackle each one of those parts by
reducing the time to detect, reducing the time to resolve
or mitigate the incident and reducing the frequency of
incidents, I. E increase the time between
failures. So how can we do that?
We do that with a combination of technology and human
aspects like processes and enablement. At Google
we found that once a human is involved, the outage will
be at least 20 to 30 minutes, so automation
and self healing system are a good strategy. In general,
it both help to reduce the time to detect and the
time to resolve. Let's zoom on each one of those parts
separately. Those time to detect, also called TTD, is the
amount of time from when an outage occurred to some human being
notified or alerted that an issue is occurring.
As part of our SLO drafting our reliability targets,
we also want to do risk analysis and identify what
we should focus on in order to identify and minimize
the TDD. A few additional things we can do are
aligning your slis, your indicators for customer happiness,
as closely as possible to the user expectations which
can be real people or other services.
In addition, our alerts need to be aligned with our slos,
our targets. We also want to review those periodically
to make sure that they are still representing our customers
happiness. The second thing is having quality alerts
measured using different measurement strategies.
It's important to choose what works best for getting
the data. It can be from streams, logs or batch
processing. In that regard, it's also important to find the
right balance between alerting too quickly that can cause
noise and alert fatigue versus alerting too slow
as it may affect our customers.
Note that noisy alert is one of the most repeating complaints
we heard from operation teams, traditional DevOps and SRE.
Another repeating issue is having alerts that are not actionable.
We want to avoid alert fatigue and we also
want pages that need immediate actions to achieve
that. We want that only the right responders
will get the alerts, only those specific team
and owners. One of the most followed
question is if we only page on
things that required immediate action, what do
we do with the rest of the issues? Remember,
we have different tools and platform for a reason.
Maybe the right platform is a ticketing system, a dashboard.
Maybe we need only the metric for troubleshooting and debugging
in a pool mode. The second part is TTR,
time to repair. It begins from when someone was alerted
to the issue to when it was mitigated. The key word
here is mitigated. This doesn't mean the time
it took you to submit code to fix the problem.
It's the time it took those responder to mitigate those
customer impact, for example by shifting traffic to another
region. Reducing the time to repair is mostly
about the human side that I mentioned. We want to train the responders
having clear procedures and playbook and of course reduce
the stress around on call. So let's expand on
each. Unprepared on callers lead to
longer repair times. So you want to have on call
training like disaster recovery testing per on
call or shadowing or running the wheel of
misfortune exercise. Remember that on call can
be stressful, so having clear incident management processes
can reduce that stress as it's clear any ambiguity
and clarify the actions needed. And for that
purpose, let's introduce briefly how we can manage
an incident incident management at Google
Protocol iMog is a flexible framework based
on the incident command system ICs used
by firefighters and medics.
It's a structure with clear roles, tasks and
communication channels. It established a standard,
consistent way to handle emergencies and
organize an effective response. By using such
protocols, we reduce those ambiguity, make it
clear that it's a team effort and we reduce the
time to repair. A few other things that you can do
is to prioritize and set time for documentations.
Create playbooks and policies that capture procedures
and escalation path playbooks don't have to
be robust at first. We want to start simple and iterate,
which provide a clear starting point. A good
rule of thumb that we advise our customers and you might
be familiar with is see it, fix it and
letting new team joiners to update those
as part of their onboarding process. Remember, if the
responders are exhausted, that will affect their
ability to resolve the issue. We need to make sure that shifts
are bonds and if not, use data to understand why
and reduce toil. We also want to have as much quality
data as possible. We especially want
to measure things as closely to the customer experience as
possible as it will help us troubleshoot and debug
the problem. For that, we need to collect the application
and business metric to have dashboard and visualization
focused on customer experience and critical user journeys.
That means dashboards that are aimed for specific audience
with specific goal in mind. A manager view of slos
will be very different than a dashboard that need to be used
for troubleshooting. Can incident the third part
is time between failures, which begins from the end of one outage
to the beginning of the next. Other than architectural refactoring
and addressing the failure points that come out of the risk
analysis and process improvement, what else can we do?
We would want to avoid global changes and also adopt
advanced deployment strategies considering progressive
and canary rollout over the course of hours days or
weeks. This will allow you to reduce the risk and
to identify the issue before all your users are affected.
Those can be integrated into continuous integration and delivery
pipeline, having automated testing and gradual rollout
and automatic rollbacks CI CD save engineering
time and reduce customer impact. It allows you
to deploy with confidence. Another not surprising point
is having robust architectures, having redundancy,
no single point of failures, and implementing graceful
degradation methods. We should also adopt
dev practices that foster a culture of quality,
create an integrated process of code review and
robust testing. Remember, it's all about resilience.
So in addition to training our responders and running disaster
recovery exercises, we also want to practice chaos engineering,
finding issues before they fund us by introducing
fault injection and automated disaster recovery testing.
Lastly, we want to learn from those failures and make
tomorrow better than today. For that, our tool
is postmortems. Postmortem are a recorded
way of an incident and those should capture the actions needed.
In Google, we found that establishing a culture of blameless
postmortems result in more reliable system and
is critical to creating and maintaining a successful
SRE organization. For that, it's important to assume good
intention and focus on fixing the systems
that allow the incident to happen and not the people implementing
postmortems. Start with educating the team about blameless
postmortem, running postmortem exercises and crafting those
policy so that we will learn from incident and
will effectively plan work to prevent it from happening again
in those future. We touch on many things we can do
in order to reduce the impact, both from the technical
and human aspects. But how do we know
what we want to focus on? We want to be
data driven in our decisions and we want to prioritize what
is the most important things for us. That data can
be a result of the risk analysis process and the measurement I mentioned
before, but we also want to rely on data collected
from postmortems. Once we have a critical mass
of postmortems, we can identify patterns. It's important to
let the postmortems be our guide, as our investment
in failures can lead us to success and
with our customers, we encourage them to create a shared repository
and share them broadly across internal teams.
We have a lot of public resources that developed
by different teams in Google so you can learn more
about Internet management and reducing customer impact.
We of course have the books. We have a coursera course, the parts
of slos that was developed by CRE team, blog parts
and talks, and webinars developed by the Devrel team.
I've curated for you a few resources to get started and
in the final link you can find resources a
publicly available breakdown by level and
resource type, including the cheat sheet.
Finally, there is a wonderful gift you can give any presenter
the gift of feedback and
as we are virtual and because we are
all about data, I will kindly ask you to go to Bita,
yell at feedback and provide your take.
I was Ayelet Sachto and you are welcome to connect with me
on Twitter or LinkedIn. I will be sharing soon a new white
paper on incidents management. Thank you for listening
and enjoy the rest of Con 42.