Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer,
a quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable your
DevOps for reliability with chaos native.
Create your free account at Chaos native.
Litmus Cloud hi.
Welcome to our talk. This is reducing trauma in organizations
with slos and chaos engineering. My name is Mandy Walls.
I am a DevOps advocate at pager Duty. And I'm Julie Gunderson,
a senior reliability advocate at Gremlin. And Mandy and
I are really excited to be here with you today because we
actually both worked together at Pagerduty and
now at different organizations. We really see ways that
you can combine some of these practices to make for
really reliable organizations. So, Mandy, awesome.
Yeah, thanks, Julie. Yeah. One of the things we talk about at Pagerduty
we call full service ownership, and it's about focusing
on the reliability of services that your
team creates once they get into their production life,
whatever that is, if it's internal production or external customers or
whatever that is. So part of knowing how your services
are performing. Mandi setting goals around
that performance is really a key to keeping all of your users happy,
whether they are external customers or internal customers.
So we're going to talk about using tools like
service level indicators and service level objectives and
how they can help you focus on what your users need. And then adding
to that, using chaos engineering practices to make
sure you're hitting those goals as you're working on your service and as you're making
those improvements. One of the things that we often talk about
is the cost of downtime. So downtime costs money,
quantifiable and unquantifiable costs as
well. I mean, on the quantifiable costs, you've got your revenue. So you can just
talk to your accounting or your sales folks or employee production,
that customer chargebacks, and breaching those slos,
that's where we really want to focus this talk on.
But just so that the other things don't get lost, we also
want to remember the unquantifiable costs, such as brand defamation and
employee attrition. And so with that, Mandy, why don't
you kick us off with some information on slos? Slos and SlIs.
Yeah. So a lot of vocabulary on this one. So let's
just set some baseline so we'll go to the next
slide there. We're not going to talk too much about slas.
That's sort of the realm of lawyers, right?
If you've worked for a software vendor like both of us do,
you probably have some slas that your legal team and maybe
your insurance company and a bunch of other people get together to set,
so that it's part of the contractual agreement between
your company and your customers. And there might be places
where there's, like Julie mentioned, chargebacks or some kind of remuneration
for outages and things like that. That's beyond what most
of our SRe kind of practitioners can get into. So we're going
to focus on the other pieces of this. So the first one here is our
indicators. Our service level indicators are
the metrics that we're going to work towards. The things that we're going to figure
out how important they are to our users and where they
need to be. And then our objectives are going to be the parts of those
metrics that we can set goals for. They're going to be the places
where we know beyond a certain point we're going to start losing
users. So we're looking at sort of the tolerance level
for where we can experiment and
push a little bit of risk and still keep the customer
base happy.
So then we have the rest of the time,
right, we think about, okay, well, we've got some
sort of uptime. That's kind of what we're measuring against. We will have some
sort of goals that we're going to set. And the rest of that, whatever's left
out of the pie, is going to be our air budget. And you can kind
of SRE from this sort of silly diagram.
The pie chart is 99%, things are good,
and 1% is our sort of wiggle room,
that place where we're able to maybe try
something out. We're not sure it's going to work 100%,
or it might be a little bit out of range for what our goals actually
are. But it gives you this place to
measure yourself against your goals and implement
changes in a way that you're still preserving your customer
experience. So it gives you a measurement for maybe
you need to improve things, like removing services from
load balancers before you restart them, or using blue green deploys or
all these kinds of things to maintain this sort of air budgets goal
that you set for yourself, for things. And when we add chaos
engineering into the mix, we can look at this SRE
pyramid this comes from, you know,
you really start with the monitoring and observability, right?
Just like what Mandy was talking about with the SLos and the Slis.
And then you move into that incident response, and then when
we get to the post incident analysis. So let's say you've had an incident.
Now, you've run a blameless post mortem on it. One of the things that you
want to do is then obviously work towards those
fixes that you have. You want to actually have an internal
time that you've agreed to as a team to work
towards remediation efforts so that those incidents don't occur again.
But then you want to test that, Mandi. So you want to repeat those
incidents with chaos engineering, and you want to automate that
so that you can make sure that those fixes have worked and that
they continue to work, and then to your testing
and release procedures. You want to bring your chaos engineering practices into that.
And then you move to capacity planning and development, and then product,
which, again, goes back to where Mandy was talking about with error
budgets and setting those in. And so
then we go ahead and talk about kind of how we center on that customer
experience. Yeah, really, Julie, the whole point of the whole exercise is
about keeping your users happy. Right. We're looking for the things
that they care about. And if we go to the next slide,
we have to figure out what users
actually care about. Right. You might have users that are really sensitive
to slow loading pages. You might have users that are sensitive to
large payloads on mobile because maybe their
mobile traffic is too heavy for what their connectivity is.
There might be lots of different behaviors that they have, and you want
to be able to test that and take a look at it for what
the user behavior looks like. Well, and Mandi, we've seen it, too,
with the state of X, Y and Z
reports and multiple reports. We've seen. We've seen it in our Gremlin
reports, in the pagerduty reports that users, they won't
wait. Now, that app or website takes a long
time to load. They're out of there, and they're onto a competitor who's really
built that user experience in. Yeah,
absolutely. So we're looking at maybe
there's a place where certain user behaviors only
apply when folks SRe logged in, or maybe they have different behavior when they're
sort of a guest. Maybe there are certain things that are super important,
like your search function and your shopping cart. But maybe management
or updating of billing information isn't quite as important
because people aren't using it all the time. So you're really looking for the things
that people are really gravitating to
or using a lot in your applications and the
behaviors that they really want to have.
Be fast, right? Like thinking about, like Jolene mentioned, people are
going to leave, right there's certain places like, okay,
I'm not going to change my bank if their app is SLos, but if I'm
shopping for something or I'm looking for some music or something like that,
if you're not responsive, I'm going somewhere else,
right.
So some of the things that you might find that your users care about,
right? No errors on your main page or some
module loads first. And you can actually see this as
a user. As you're floating around the Internet
and you pull up a site, you can notice which modules
load first. YouTube is very obvious, right? The video
player goals and then the rest of the page builds around it.
Because that video load and that actual
video itself is the main part of the experience. So when you're thinking
about that for your own services, how the page
loads, the pieces that build in, where they come from, how long they take,
all part of your user experience. So lots of things to think
about when you're working with your users and what they care about. You might
actually have to ask them about the
things that they care about. Sometimes that's things that we tend
to forget. We want to develop things for ourselves that we think our users
are going to love, right? But then upon releasing it, we find
out that our users are not using that new feature and they're using some random
thing that we thought nobody would like. And it's just really
understanding that user experience.
Definitely. And there's lots of things that you probably already have
in your metrics that sre important but not
primary to your user experience. Like some stuff is going to
be maybe early warning, right? Cpu utilization,
memory allocation, those things are super important in
that they will impact user experience down the road.
But you want to broaden your definition of the things that you want to look
at. As you're setting these slis, you're looking for the
actual behavioral aspects of your application,
of your services. When they hit the user. User has no idea
that your cpu utilization is at 85%. They just know
that your queries are slow, like they're not getting their searches back when they're searching
in your site. So thinking about what the user experience is and how that's
going to translate into the metrics that you're collecting and what metrics you might have
to add is another super important step of the whole
exercise. And then that's kind of how we talk about, like we talk
a lot about metrics. And some people get a little bit scared when it comes
to chaos engineering because maybe they don't have a ton of baseline metrics,
right? And that's okay, because you can actually
use chaos engineering, which is practices injecting
failure into your systems to understand and to validate
your monitoring. Mandi, your metrics. And so when we want
to know, okay, our slos, our slis, are they
right? We can go ahead
and practice that. We can imagine what that
customer experience is going to look like. Let's say
if the email server goes down. So I check out,
I purchase my item, but I don't get that immediate feedback
as a customer with that email confirmation,
that's okay. Was I still able to check out? Was I still
able to complete my purchase? Were those things able to
happen within the set defined timeline that we
have agreed to? So we want to inject
that failure proactively so that we can validate.
Are these objectives that we have set, are they the
right ones for our team? Are they what we should be holding the expectations
to, and that goals along to the customer experience?
And so, Mandi, talk to us a little bit about that.
Yeah, there's a bit of a place where before
you embark on this journey, right. There's some
operational maturity that you'll want to have in place before
you go down the path to these slis and your slos. So some of
the things we've already talked about around user experience and those
kinds of things, you want to already have in your pocket, right? You want
to have a good idea of the impact of things
like new features or changes or
degradations and how users are responding to those. And you
also want to have the mechanics via your chaos
engineering tools to be able to work with that. So you want to have
a lot of telemetry already available, right? You might have open
source or commercial solutions or whatever it is, but you're going to be collecting
your user facing metrics, right? That's going to give you perspective on
what users are experiencing throughout your services.
You might have a set of synthetic monitors that are going to tell you
the things you know about the things you know, right? That's the place where
you've already sort of pre populated what needs to be known
about those potential components. You probably have some logging set
up so that you're tracking things on a post hoc basis,
so that you can collate behaviors and other
events as they happen throughout the ecosystem. So it's a good
place to do that. You get all the text on it, get all the timestamps,
all that fun stuff, and then you can
kind of put those two together with a tracing tool. Gives you
a place to start dealing with the complexity, especially if you've got a wide
distributed system. If you've got a monolith, you might already be
in a good enough place, right? You can kind of cheat a little bit
because things aren't moving around a whole lot. But if you do have a
widely distributed ecosystem, you're going to want to have some tracing so that
you can follow user requests throughout all of your
services. And then observability tools are going to help
you underpin all of these components, give you a more complete
picture of the ecosystem.
In a more generic sense, it helps you sort of tag
down to the unknown unknown. So things that you weren't expecting, because you're
poking sort of that black box of your software with some inputs and
seeing what pops out the other side. And that gives you a lot more ability
to say, okay, this user behavior is indicative
of this set of requests and this set of otherwise
sort of hidden back end requests and things like that.
So I can know then as these users come in with
this particular use case and behavior, they're going to hit all of these things and
I can start tracking those down for my Slis and my slos.
Another good place where unfortunately, Julie and I have
both sort of seen places where folks don't exactly
have a really good picture of all their dependencies.
And this is super important. You want to be able to know your
services, what they're consuming. You want to know if they're eating
bad stuff. Right? And if you've got a back end dependency
that its slo is really low,
your services can't have a more stringent requirement if
your back ends aren't up to performing to that requirement.
So it gives you a place to really start thinking about other
teams that you're working with, things that you can do more defensively.
Maybe you can red button something when it goes
out of your range for tolerance and those kinds of things. Mandi,
doing some more advanced techniques to protect your service
from things that aren't up to your requirements.
Well, and another thing too is sometimes you don't realize that
there is a service that's actually a critical path.
You may think it's not. This is not a critical path. If the redis
cart goes down, we're fine, right? But then when
you test and when you're purposefully injecting this failure,
you might find a critical path that all of a sudden makes you realize
you need to redefine your slos based
on that. Absolutely. Let's take
a look at, like, it's going to look like math,
right? So if we go to the next slide there, we've got kind of
a generic model for goals, right? You have your service level indicator,
which is going to be some text, and then you have your service level objective,
which is going to be kind of some numbers, probably.
And then we've got a period of time which is represented by T.
And our SlI is going to be the number of
good things that happen divided by the number of all the
things that happened. And we're going to multiply by 100 to get
a percentage. And then our error budgets are going to be whatever's left
over. So if I'm going to say, okay, my service level indicator
for 500 errors on my main page is 99%,
then my error budget is 1% of those requests can
be out of that range before my customers start to get really
unhappy. And we have some examples on the next slide
of how these numbers sort of fit together, right?
The bigger the pool of events, the more wiggle
room you can kind of get, right? At the same percentage points.
It's just math, right? So at 100,000 events,
if I have 99,000 good events, that means I can have 1000
events where maybe I'm trying something different out, maybe I'm
doing some experimentation, but I know that I have that
air budget, that sort of wiggle room to do a bit of things
that maybe sre outside our goal parameters,
and our users are going to cope with those in
an okay way. They're going to be more tolerant. Yeah. So I know
that none of us expected a math lesson today,
so thanks for that, Mandy, but really,
reliability is obviously very important to organizations
now, so. Right. We're perfect 100% of the time. Our web
requests have zero milliseconds of latency all the time.
Right. But not necessarily not in the
real world. So that's why we talk about slos,
slos, Slis. So maybe we have an SLA that
90% of the web requests have a web latency of
500 milliseconds for the month, or then that's when the
customer gets their money back, then we have
set a buffer in now for our slO, we've got
a 95% slo. So we've got this 5% buffer
between our SLA and our slO, and that's what we can use to play
with. That's what we can use to experiment with. That's where we can start
getting creative with maybe new features that we want to release to
our customers. But it's really important that we are staying within
those ranges that we have set for our
organization. And so when we kind of look at this
in play, here's a little bit of kind of a
scenario that you can run through, right? So you can look
at the SLO scenario in staging. You can do that
with gremlin. Maybe an instance downtime occurs,
right? Datadog is picking up that instance, they're calculating
it, and then pagerduty is firing off that alert,
letting you know that that SLO has been
breached. So these are ways that you can put all of these things
together to use the goals that sre at your disposal
to make sure that you are maintaining those not only
contractual organizations that you have, but that customer
experience. Yeah. One thing to remember, though,
as you're working on these, is that while your SLA
is your public facing customer contract,
that thing that the lawyers put together for you,
your slos and your slos are really for
you. They're for your team to work against and to budget
and prioritize for. They're not meant to be a cudgel or any
kind of punishment because we don't want to disincentivize people
from making changes, making changes, shipping features, getting all
that stuff out there for our customers is how we get
more users. It's how we provide them with delightful things.
So we don't want to punish people or beat them over the
head with their slO if they're not beating it.
However, it is a good place to revisit
after a post mortem or talk about during a review.
Where are we on the air budget for this quarter? Where are we on the
air budget for this particular feature? So that you can
be really conscientious about the changes that you're making and the work that you're
doing and how it impacts your users. Yeah, we've seen
it where some organizations will stop releasing new
features if they're getting close to that. Right. When you're looking at
that overall math equation and you know you're close
to breaching, you're going to say, okay, we're going to pause and
we're going to work on the reliability of what we have now. We're not going
to make any more changes so that you can make
sure that you're keeping up with what
your goals are. And you can also then use chaos engineering
to test out the new features and to make sure that you're focusing
on those slos. So some people say that you
can only do chaos engineering in production. That's the only way.
So if you're not doing in production, you might as well just not do it
at all. And that is absolutely not true.
I mean, we've had experiences where we've practiced chaos engineering,
in tabletop experiments where we're just writing ideas
on a piece of paper. Mandy, Mandi had some fun with that at one of
the summits a little while back. But you
can actually adopt the practice in development, right? So if
you think about it, you're architecting for failure. You're keeping that in mind,
Mandi. You can get confident then testing and development,
and then you can move to staging, and then you can start small,
and you can expand your blast radius as you are releasing these
new features. And then finally you can move on to production,
and you can start small with these experiments, and then you can increase the
magnitude, you can increase the blast radius. So,
in all reality, this is just how we do development, right?
You don't actually have to overthink it. You just want to work iteratively
like you would with code, move up your environments like you
would with code. We all know how to do this.
And so, Mandi, I'm going to pass it to you to talk a
little bit about working with upstream dependencies. Yeah, upstream dependencies
can be tough, right? If there sre things that are owned by your organization,
you might have some ability to put some
pressure on your colleagues and other business units to say, look,
man, users really love this thing that we're consuming
off of you, but you're not up to what they're expecting. You can have
those kinds of discussions when you have external dependencies.
You have third party pieces looking to see if
they even publish what they're going to
present to you if they've got published Slos, if it's something that you're
buying, and they have a contractual SLA looking at those,
because then you're going to use that as part of your own math to say,
well, service a is reliant on service b,
and service b can only ship us this particular availability.
We can't be better than that, right? If you want to be better
than that, you have to think about defensively coding around bad performance,
looking at turning things off, or taking
things out of the user experience if they're not performing. So really
looking at it from the user's perspective
to say, would they rather not see something than have
it be slow? Or do I need to look at alternatives?
Do I need to consume this from another provider? And really being
proactive about, and I love the term defensive coding,
too. Because that's really what we're thinking about. Again, kind of going
back to that architecting for failure. And I know we mentioned this
earlier, so I'm just briefly going to touch on using chaos engineering
to validate those dependencies and those critical paths.
But there are specific attacks that you can run, so maybe
you can inject some latency since slos are time sensitive.
Let's see what happens if
this application is required for my application to serve its core
function. Because again, we want to serve our customers.
And even though an application might continue to work in some capacity,
it might not be the capacity that supports the goals that we have
set for ourselves. Absolutely. And that
comes into things like unplanned work as well. Right. Your incidents can
indicate that work needs to be done on your reliability.
Right. Your slos and your error budgets,
you really need to make them part of your post mortem process.
You can sit down and say, we had this particular incident
happen, this is what we blew out of our air budget for this particular
service. And then you can make that decision.
What work needs to be done in our next sprint to prioritize
fixing this thing. How is it going to affect our air budget going forward?
Can we even ship new things for the duration of this
time period if we're far beyond our
air budget based on the last incident planning around those
things? Focusing on the user experience is
a real downstream tool of setting these goals,
these objectives with your indicators, so that you can really plan
defensively for the change that you need to make to
make the whole experience better. So your lifecycle
is going to be, you start with your user behavior,
you look for your reliability, your performance
metrics, Mandi, the things that your users care about. Then we're going
to set all of our goals, Mandi, our slis and establish
our slos. And we're going to work over time keeping those
slos in the green. Right. And with that, we're going to practice
our chaos engineering. We introduce a new feature.
It's going to go through testing. It may also go through chaos testing
as well, so that we know that it's not pushing
our service out of tolerance for what our users expect from us.
Right. So it's a maturity process. So make
sure you're prioritizing the user
experience first. That's the whole reason we're going through the whole process.
You're going to quantify what's good and bad via
your experiences. Work with your error budgets.
That's really just to tell your team where you are on
your time frame and then it all feeds back into work prioritization
and how you prioritize work and organize
it there. And if they're not working for you anymore, change them.
And so we've got some great resources for you.
There's the talks that you can find at Slos comp, Google's SRE book.
It's available online if you haven't read it. It's an amazing book.
I haven't read it cover to cover yet. It's one of my working through kind
of like Lord of the Rings.
You can also check out Gremlin for free if you go to SlOS
buttons so that you can practice with chaos engineering.
But Mandy, I just realized we didn't tell people how to get a hold of
us. So you can find me on Twitter at
Gund and Mandy, I'm lnxch and
so thank you for taking the time to hang out with us
today and hopefully your trauma is a little bit reduced.
Yeah, thanks very much.