Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hi. This session is
titled get ready to recover with reliability management. I'm Jeff
Nickoloff. I'm currently with working with Gremlin,
but I have 20 years in industry. I've worked with some of
the largest organizations on the planet,
and I've been working with mission critical business critical systems
for a very, very long time. Some of those companies include at Amazon,
Venmo, and PayPal, and several others.
In my time working on those critical systems,
I've been involved with hundreds of incidents,
both as a person on call,
as well as someone who's responsible for communicating across
an organization, running a man, running an incidents,
communicating with customers, digesting and translating
between engineering information technician information and
customer impact. And there's a
long way of saying I've seen a lot of different kinds of incidents,
but I think the type of incidents that bring us together
really drive us all to be interested in a session
like this, in a conference like this, are really those
with the most pain, and I don't necessarily mean
those with the greatest customer impact, although those are very important.
I'm talking about situations that I like to call scrambling
in the dark. Now, this might be literally
in the dark, where it's either the middle of the night,
you're tired,
but it doesn't have to be. It might also mean situations
where you don't know, your customer doesn't know,
your technicians don't know, no one's sure you're
having communication problems, miscommunication.
You have to do just in time research.
Scrambling in the dark might look like your technicians are
trading dashboards like trading cards.
There's an increased desperation to the situation.
And those are
the moments where the
customer impact is one thing, but it's really those moments that create a
bit of a crisis of conscious, or rather
a crisis of confidence for yourself, for your
team in itself, for your business
partners, in your engineering organization.
People begin to ask the hard question of, can I rely
on our systems?
And many, many uncomfortable questions fall out
of these times where we're scrambling in the dark.
And so this is a problem we
need to solve as much as, and with the
same urgency as we typically talk, but things like
time to recovery or
time to response those sort of like a little bit more concrete metrics.
But in my experience, more often than not,
these have similar solutions.
It really comes down to preparation.
You need to get ready for the incident. Before the incident,
you're going to have incidents that is, that is given.
And especially for systems that undergo regular
change, high velocity change, either in the system itself, the system
design, or in the business context in which you're operating the
system, either side of the coin can
change a system in ways that are difficult to anticipate.
But I don't want to be another one of those people that kind of stands
up and says, well, you need to be better prepared, or to just
be better prepared. Preparation is important,
but it's important not to minimize the level of effort
and the expense that goes into preparation.
Preparing for incidents.
It's nontrivial.
As soon as you dive into it, you have to ask yourself,
what does it mean to be prepared? What level of readiness
do we need and what do we need to prepare for?
And the level of effort increases with the complexity
of your system. The number of individual components of the system,
the number of people on your team, the number of teams you have,
the different ways that your system might interact with other systems
or your partner systems, your upstream dependencies,
or with your customers. How many different ways does your
system interact with your customers? How many different ways can
it fail? And that's
a hard story to tell when you're
asking for funding to invest in programs
so that you can be better prepared,
so that your teams can feel better prepared, so that you can be more
confident before, during and
after an incidents.
Because if you don't have that,
this is table stakes for getting to those places where you
can begin to talk about time to recovery.
It all starts with the people and understanding the space.
And it's critical not to minimize the
level of effort it requires to be better prepared.
So when we talk about preparation,
your investment really falls into two categories.
You have your detective controls,
identifying when your system is failing, potentially identifying what
parts of your system are failing, and potentially identifying
likely resolution.
That all falls into a world that
we typically think of as automation.
Automation. These look like programs or systems
that provide continuous monitoring
and automated recovery.
Things like resilience platforms like kubernetes.
This is specifically what that is intended to
provide. Automated release control, automated rollback
control, automated recovery. If a pottery or
process fails, it will automatically bring reconcile
that desired state back to having it run.
Those types of investment
in automation can be very powerful, but can also end up being
very tool or stack dependent. And this
makes it a little bit more fragile, makes it a little bit more engineering
effort to pursue robust solutions over
the lifetime of your team and your product and your
company. They need a little bit of love, and that's
okay. These are very powerful tools, but they are
expensive to maintain,
to implement, maintain and continuously improve.
They're not bad. They're critical. Right. The other side of
this is getting your team on autopilot. And what
I mean by that is having a consistent,
bringing a high degree of consistency into your incidents
response, the skill set and
context that your technicians bring,
a consistent and logical way of following through
and problem solving the incidents.
A fairly deterministic and consistent set
of remediation options.
How do we recover? Getting that on autopilot?
And when you're not on autopilot, what that looks like is
incident responders and technicians, specifically where
you have a high degree of variation in their readiness
to handle the incident. Some people understand some
systems more than others. Some people are
familiar with recent changes to a system more than others will
be. Sometimes people have
different problem solving responses, different problem solving workflows.
In other times,
different people will be familiar with different sources.
For truth, this might look like dashboards.
It might look like awareness of specific, some alarms
or maybe non alarming monitors that
have also been set up, that are designed to, or that are
in place to try to help triage and
identify a path to recovery. But those
things are not. There is
a missed opportunity if your team can't
use them with consistency. So when I say getting on autopilot,
I mean bringing a high degree of readiness
to the people on your team, making sure that
they're aware, making sure that they understand the systems and how
those systems fail, making sure that they understand the tooling and
everything else that are available to them,
and making sure that they get reps, making sure that they get practice,
making sure that they've seen the various kinds of
failure before they show up in
an incident.
This is an investment, and a regular investment into the
human side of things. But either way,
regardless of which of you're going to end
up investing in both of these things, the question is what
to spend on each side of these things,
and then also identifying what
things to prepare for, which kinds of incidents to
prepare for. Most systems have.
They can become quite complicated quite quickly with the number of dependencies,
the number of ways things can break,
and under which kinds of conditions different parts of the
system may break or may need different kinds of love in
order to recover.
In some cases, you might be asking yourself, like, what can we change about the
system before an incidents?
To either reduce the probability of an incident
or to speed recovery.
But again, coming back to the complexity
and the level of effort that goes into just preparing,
it's going to be very difficult to prepare for everything.
That's a very long tail that we'll all have
to be chasing. So the question really comes down
to not just
do we invest in automation versus autopilot,
but which types of incidents,
which types of failures should we invest in preparing
for? And that's a
nontrivial question. You can answer it trivially.
Some people may have seen that some
people might be more familiar with different types of failures than others, and so
they'll probably lean towards that, naively lean towards the
things that they are familiar with, the failing or the
last thing that they ended up being paged for.
There's typically a strong bias for that. But if
you're standing in a position where you have the opportunity to choose,
there's a better way. I want to
talk about the relationship between incident management
and incident response and reliability,
reliability of your systems and running a reliability program.
These two things are definitely separate efforts,
but there's an inherent relationship between the two.
Your reliability program.
We put these things in place so that we can proactively,
not retrospectively,
looking at what has been breaking, but so we can proactively identify
and regularly assert what incidents
we are at risk for, the probability of those risks,
the severity, if these things break. And we
use that information to inform what incidents we
should prepare for, and we use the information that
comes out of managing incidents,
how prepared are we to handle these incidents?
How long does it take to recover? What's been the financial impact
of the last three times or however
many times this type of failure
has happened? We use that as an input back into the reliability program
so that we can prioritize what to change and
how to measure. So, reliability program.
This is a very high level, abstract idea,
but in general, what this looks like is
having a strong idea of being able to
enumerate the components in your system,
being aware of the ways that they might fail, being aware of your
dependencies, being aware of the value,
usually, like by rate, of how valuable certain systems
are, and then regularly measuring and determining
what types of operational conditions those
components can survive and specifically
where they tip.
And then obviously there's a whole thing around identifying,
funding and staffing for engineering
improvements so that you can hit reliability goals.
But I really want to talk, but to zoom in on the mechanism,
the high level core mechanisms of a solid reliability program,
there's a tool called failure mode and effect analysis,
effects analysis. This is a pretty robust
framework. It's rare that I've seen it
implemented in SaaS and software space in a
deep way, but it's a really important system,
even if you're taking only high level inspiration from
it. A failure mode and effects analysis is
a robust and opinionated framework for
cataloging the components in your system,
the failure modes for each of those components,
the probability of those failures,
and that term starts to get a little bit fuzzy and
really dependent on your business,
the severity impact of
that type of failure.
For many groups, this might look like financial
impact, this might look like downstream.
If this fails, you can begin to talk about cascading
failures, although failure mode and effects analysis is really not so much concerned about
that. It's usually first order impacts. But if you can get to money, it helps
you craft a better story later.
And these analyses also typically discuss
and presents an opportunity for you to determine whether
or not you can detect the type of failure.
But the big idea is you get this information,
you build out a big table, it might be a spreadsheet, whatever it
is. And this helps you identify all
sorts of risks in your system to really identify where the
risks are. I want to talk for
a moment about failure modes and your detective
controls, because this goes directly to your
incident preparedness,
as you're enumerating, as your reliability management program
is enumerating the types of failures for each component
and whether or not they can survive them. Another big question for
each of those types of failure modes is, can you detect it before
your customers do, or how
quickly can you detect it?
And that's really because if you can't,
these are clearly going to be gaps in your
preparedness. If you can't detect
whether or not a failure mode has happened,
you're going to have poor response time. If you
can't detect whether or not this failure has happened,
your incident responders, when they do respond, are going to
have a more difficult time identifying the nature of the failure.
And so it's naive. I could stand up here
and say, make sure that you've got detective controls for everything. But this is
one of those cases where you want to look at
that breakdown to say,
does this type of failure mode warrant investment
into detective controls?
And it's important to be able to test your detective controls to
regularly, I don't mean at one point in time,
but to regularly create
failure conditions in whatever environment to
verify that your detecting controls operate the way
that they're intended.
The next step, and it's not really a step,
but the other part that I'd already discussed a little bit
is, so when we're bringing it back to how do we prioritize what
to invest into, the real big question is,
well, where's our biggest risk? And I mean not just in
like. And when I say risk here is like probability
of failure, but also multiplied by the severity
of the failure. If you have something that is very
expensive, if the failure occurs, but is extremely
unlikely, then this might be a lower priority
than a type of failure. This might be a lower priority
to prepare for than a type of failure that happens
three or four times a day and is
likely to continue that that has a more
mild cost associated with.
But you have to do that reflection activity. You have to actually ask yourself,
how likely is something to happen? And that
typically requires some type of experimentation. You should test it.
Can this happen? Under which conditions can it happen?
And when it does dive into the business. Look at
your volumes, look at your, if it's
a revenue type business, how much revenue is associated with these interactions?
If this type of failure might result
in some breach of contract,
it's important to understand the penalty for those types of violations and bring
that in. Let the business inform your
engineering decisions.
And so there's
a lot of different ways to do this. It's easy to say probability and
severity and talk about risk. I've seen it done a lot of
different ways. And one
of the concerns there is having inconsistency in
your organization. If you have ten
different groups in your organization and each of the ten groups
are doing it slightly differently, it becomes very difficult to
prioritize for your organization because you're often
comparing apples to oranges.
So regardless of what happens, regardless of
how you move forward,
consistency in measurement, consistency in those
metrics that you're using to drive, prioritize decisions
that dictate how you're
going to spend your money in improving your preparation.
Consistency is key.
And this is one of the problems that we're solving at Gremlin that I'm
so passionate about. Our new product reliability management.
Product scoring is really central to it. And at
minimum, this is something that we've learned from
the vast experience in building reliability programs
with companies.
More often than not, from my experience and other conversations
I've had with people at gremlin, more often than not,
these are the dimensions that our customers find great success
with. And like I said, the specific scoring mechanism
you use is less important than having consistency in scoring.
So what we've done is we've gone ahead and built a consistent scoring
mechanism on their behalf.
And so this is just an example.
We do regular testing for redundancy,
scalability and surviving dependency
issues. We combine those into an
easy to understand score and our customers,
and we help present this to customers
in a way that they can understand reliability
issues between different services. Now, if you were to dive
in, you can see specific conditions, and you can use
those types of failures to inform the types
of incidents that you should be prepared for. But the real
power here is being able to know
and identify what things can we survive? What things can
we not survive for? Those things that we can't survive,
what's our impact? Right. So however you end
up implementing it, this is one of those great.
What I believe to be a fantastic example of
what you should end up with at the end of the know.
That's why I'm so excited about what we're building here at Gremlin.
At Gremlin, this has been our focus since the beginning,
but we're really making explicit now
that our mission is to help teams standardize and automate the reliability
one service at a time and to help them understand at the service level
in a consistent, repeat able way what they can tolerate
and what they can't, so that they know this is that you understand how to
prioritize your improvements, either in your product engineering
or in your incidents response preparedness. Your incident
preparedness. Thank you.
That's all I have today, but if you have any other questions, I would love
to see them in the chat. Thank you for everything.