Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to conference 42. So here's my session today
for managing service reliability by managing
risks. Risks are very, very vital for any SRE.
So just broadly cover upon this,
going to the next slide here.
So first of all, let's quickly understand,
are your SLR realistic. So if you see
here there's an application which has multiple services and
then there are again sub services within that,
but are really sure the SLO,
the slis defined can be met, what are the risks
associated with this? What would happen if one of the services
get impacted? Or what would happen if the underlying
cloud infrastructure or the underlying infrastructure gets unavailable
for a moment? Do we have an auto healing mechanism
in place? There are so many risks which anyone
is unaware, and thereby there need to be
a very stringent mechanism of putting up
risks within any scenario or any
service. So that's what we define as risk analysis.
It helps you to provide, prioritize and communicate.
So services obviously can be made more
reliable if you know the risks. For example, the example,
the previous diagram, we looked into it like there are so
many different applications, services,
there could be a challenge, or there could be an impact on dependency,
there could be an impact on capacity, there could be
an impact within, on the operations or on the release cycles.
So these are underlying risks and what
they enable us is to understand what
could it take to recover. So meantime, to detect becomes
one of the vital criteria and it
is part of the risk. So when you create a risk analysis, you look
at your different,
different reliability matrices like MTTD,
which is your main time to detect whether it is mean time to
repair, how many percentage of users get impacted,
and then what's the probability of occurrence. So that's
the value of the risk analysis. So how do you
define the risk? So the way we have
thought, or I've seen most of our customers builders
use a risk catalog. So what do you mean by a risk catalog?
So risk catalog basically is a structured way where you prioritize your
risks or you capture all your risks and then you
can definitely look at prioritizing them,
identifying different counters, identifying and brain storming within
the team, what would happen,
looking at the past data. So see what are the various matrices
for those risks. So it is very essential,
a very essential part of your reliability
management to first define your risks.
Create a catalog which could be brainstormed with
multiple team members, multiple teams,
developers, as well as system engineers,
cloud architects. It should include
this catalog across infrastructure and
software. It should also define your key Slis,
your MTTD, MTTR and one
prominent way we have seen or I've seen customers or our teams
do it is to map the user journey. So looking
at from the areas or the points of
interface a user does and tracing it
back straight to the different, different paths
the user takes. So imagining if it's a retail
commerce. So users might be interacting on a
particular, let's say catalog
service which where you could see all different products
and different catalogs. And then the user journey
might be to select a product, add it to a cartridge,
and then finally could be multiple products and then could
be finally checking out and possibly shipping. So you look at the
user journey, looking at what are the different services
where the user gets impacted or user touches upon.
And obviously one of the very key point is to look at
the past incidents. Past incidents give you a
lot of data to define and include it in your
risk catalog. So I'll quickly show you one of the sample risk catalog.
This is from Google SRE.
So here if you could see what does the
risk catalog looks like. For example, starts with a
configuration mishap which reduces capacity.
And then we have MTD typically meantime
to detect. So you could look at systems getting detected this
within 30 minutes, what could be the possible time to repair.
Again, this comes from discussions,
brainstorming, connecting production
incidents, collecting data. So you refine this data
or refine this matrix as you
get more and more information. And then you look at what is the
likely impact of this. Again, this 20% number
comes from experience, looking at teams and
then the likely impact, this is your own release cycles. You could
look at and depict this number. And then you finally look
at what is your incidence per year and how much is your
bad minutes per year. Similarly, if you look at new release,
again when you are ready with the release and
roll back if it is required,
so what's the likelihood. And then we have various parameters
or reliability matrices and
similarly so, and so forth. You could see others like unique
breakdown or unnoticed growth. And possibly
there is a outage within the cloud as well. This is a great
dependency and there could be another scenarios. I mean these
are all likely possible, like operators
are slow or, and so basically what
it gives you is a fair amount of elements.
And this list can be endless. Like you could add many risks
and good thing is that you can refine,
you can prioritize, you can maybe reduce
or include and continuously keep on enhancing or
revising this basis, your own observations
where your performance, this is your own experience and this
is the data points. So a very high level view
of typical risk catalog and
what it helps, ultimately, once you create your risk
catalog, is to rate your risks. Now, as I said,
we saw the earlier one, like bad minutes per year.
But the way to rate, again, this is so
looking at this user journey, again, the user looks to be happy when it
is in blue, but then it all
goes into a challenge when it is in orange.
So there are various parameters, the time to detect, which is your
MTT, which is very, very critical, then time to repair
again is very critical, and then the time between failures.
So as I said, you first define your risks,
prioritize those risks based on this data,
and then you collaborate between the teams. Start with
the estimates, you collect more data and you enter
it. So very, very essential, like time to detect. So we've seen
today particularly observability,
then some more time, like log ingestion,
log management, all those stuff really helps in
reducing your time to detect. And then we have
automation, particularly looking into the time
to repair. So again, the matter of fact is,
as you rate your risk, you can also then start planning
and looking at how do I mitigate this risk? Is this
a risk okay, for me to carry? Is this a risk which
definitely needs to address, which there
needs to be some mechanism to address.
So these are very vital elements. Automation really
helps. And again, time between failures. How do you
make your systems and your releases more stable,
so you can look at maybe embedding some
different, different options, so that you are able to
get this more in a more.
I mean, you are reducing the time between the failures. So far
I've just broadly covered the risk catalogic risk analysis
and why risk is very critical. And then
how do you rate your risk now I could quickly cover
upon accepting risk, and this is a very vital part of your
SRE work, which risk should be prioritized,
where we should focus and put an engineering effort,
or whether we should bring in a larger
team or maybe product management team to just check. This is an
extra engineering effort which is required. Otherwise there
could be an impact on error budget. So here, if you could see there
are certain parameters, for example,
operator accidentally deletes. This is one of the risk and
this creates 129. Is it acceptable? Possibly yes,
because it falls within my error budget. I could
use this. So the ones in blue are possibly the ones
which possibly I could accepted. So just marked it?
Yes. And then the ones in the ones in
the pink or the, or the
peach color shows address which cannot be accepted.
There could be a possibility the first three will not get even accepted.
Basically it definitely might impact
the other budget therefore would have a major impact.
So it is going to require, if required,
I mean, I need to spend some time to look at or put some engineering
effort in order to address or mitigate this
kind of risk. The ones within yellow should not be
accepted. There could be a major issue,
or this could be a major consumer of mirror
budget. So there, there has to be a mechanism.
So this amber color defines that. And it
is something which may need an address,
either maybe as a second priority or a third party,
but it requires addressal. And then the ones which
are in green possibly could be ones which I could accept.
And even though these occur,
they may not significantly consume my error budget. So the
risk mitigation, the risk cataloging is a very vital element.
And we've seen, I've worked with many customers
in helping them build such risk catalogs,
then putting a risk mechanism
or risk analysis to then prioritize this risk.
So one very key element which I also observed is the role
of chaos. So here I've seen
chaos engineering could be a very vital role, because when
you look at estimating the risks for different,
different scenarios or different,
different areas, where this could be useful
is by chaos, you can attempt or
identify certain blind spots. You can attempt some failures,
inject some failures, inject some faults,
and then really understand this risk.
Is it really, there are your observability,
or what is your MTTD? What is your MTBF?
And what is your MTTR? Whether your catalogs go
and one very vital component can also build different
risks or certain unnoticed risk or blind spots,
which you never realized this could even happen.
So definitely chaos engineering comes, and it is
a very vital portion of it. So what can help is
identifying and putting
certain fault areas and then using this as part
of your risk analysis. So a lot of work we could
build, or as an SRE can use the chaos
and many such principles to be able to
refine and build risk catalog, analyze the risk,
and as well as prioritize the risks and then define your
engineering efforts. With this,
I come to an end. It's a fascinating discussion. Hope you
enjoyed my talk and any questions
or any queries, I'm always available. If there is anything
which is needed or any contact needed, please feel free to
reach out. These are my coordinates. Thank you. Thank you very much.