Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, I'm Emily Arnott, the content marketing manager at Blameless.
And today I'm going to share with you something pretty exciting, a new definition
of reliability. In this talk, I'm going to break down a
surprisingly difficult question. What is reliability?
We're going to look at why we need to align on a single
definition, even though it can be kind of nebulous. We're going
to zoom out a little bit and take a look at how this definition applies
to the real world to get a little bit more context. Then we'll
zoom back into the technical and look at three different services
and how this definition of reliability applies to them and
can help you prioritize. Then we'll take a look at the real
gritty question of how do you measure this new reliability?
How do you know you're moving towards it? So what is
reliability? We can all agree that it's very important, right?
It lives in the very center of what we do, site reliability
engineering. But once you actually start asking people, it becomes a
surprisingly nebulous question. If you ask ten different engineers,
give me your definition of reliability, you'll probably get ten different
answers. Maybe someone will say it's basically just your uptime,
right? Like if you're reliably available,
that means you're reliable, and that makes sense, but maybe you
should include the error rate. Like, sure, the site is up, but if it's giving
the wrong information, that kind of makes it unreliable.
And then if you're kind of thinking along those lines, well,
the speed matters too, if the site is giving the right information,
but at a snail's pace, are people really going to think of it
as reliable? Maybe some of the more SRE inclined
people could cite Google's SRE book and say that it all
comes down to the customer's expectations. They set the context
and the standard of what is reliable enough based on
their expectations. But that kind of just opens up more questions.
Like what customers? How do you know these
expectations? How do youll know when they're happy? Do any of these really
cover everything? I think even if you can answer all of these questions,
I'm going to put forth that this still isn't holistic enough.
It still doesn't consider one key ingredient that we're going to get into.
And this stuff is important. Orgs of all sizes are realizing
that reliability should be a focus. In many cases,
it should be priority number one. And even if they don't know what reliability
exactly is, they know when it's missing. Here are
some examples of recent major outages among major tech
corporations. Within minutes, decades of goodwill,
of customer service, of customer happiness, can be wiped out
by one bad incident. The costs can be astronomical
for big companies and even more devastating
for small companies that don't have the leeway to recover from a major outage.
If you aren't considering these sorts of factors,
incidents can easily overwhelm you and put you deep
into the red. So what is our working definition
of reliability? What is the one that I want to explain to you? Well,
it starts with product health, which is something that I think
we all understand. It's what's being spit out by your monitoring
data. It's everything that you have in terms of how
your system is functioning, how fast it's functioning, how accurate
is it. But we like to contextualize that with customer happiness.
As it says in the Google SRE book, these numbers really mean nothing in a
vacuum. They have to be compared to what your customer's expectations
are. And then we have our third ingredient, which I think is all
too overlooked in the technical world, and what we consider kind of
the secret sauce of this new perspective,
sociotechnical resilience. This is the ability for your
teams to respond quickly, confidently and effectively
to any incident that may occur. And that's what we're really going to be focusing
on in this presentation. So think about the real
world, for example, and you'll find that this definition is probably
already deeply entrenched into how you
make decisions, how you judge the reliability of something
where you put your confidence. So consider flying.
We'll set up our three buckets here, and then think about all
the different assumptions you make when you choose an airline. When you step onto
a flight in terms of customer happiness, you feel
like the airline is prioritizing your needs,
that they have some sort of picture of what would make you a happy,
satisfied customer. And the choices they make are aligned
with that. You also assume that the airline systems are working properly,
that whatever computer programs are assigning you
tickets, are printing your boarding pass,
are making sure that there is in fact a seat for everyone who bought
a ticket for that flight, that are displaying the correct gates.
All of these little things that build up the workings of the airline and the
airport are functioning properly. You trust these things.
You also trust that all the systems in place to stock the airline,
the actual airplane itself, are consistently functioning.
That there's going to be food and water, that the bathrooms
are going to work, that there's going to be toilet paper, all these little things
that you never even consider you're implicitly
assuming are happening behind the scenes consistently.
And when it comes to these stocking choices, when it comes
to the things they prepare, once again, there's this implicit
assumption that the airline is making these choices based on what
youll want, that the airline understands what makes
you a happy customer and are working along those lines.
But there's a major factor here, and this is something
so implicit that we don't really even think about it. But we also put
a lot of faith that all the people who are working at the airline,
that are flying the airplane, that are serving, you know what
they're doing. You assume that the pilot knows how to fly,
right? That goes without saying. You also assume that the crew will
show up on time and be ready to work. You assume that the crew
will be in good spirits, that they'll cooperate with each other
and with you, that they'll be able to execute on things that
you need them to do throughout the entire flight. These things go without
saying, and yet our understanding of what makes a good flight
is so rooted in them. It's so essential to the operation of the
entire industry. The airport staff itself also needs to know what to do.
We extend this sort of thinking about the sociotechnical resilience
and capability to everyone we interact with in the process
of flying the plane. So let's take
a look at when this breaks down. In the previous holiday season,
if you tried to fly out to see family or friends or go on a
vacation, you may have had a rough time. You wouldn't be alone.
There was a once in a generation winter storm
that passed through the continental United States.
Over 3000 flights were canceled. We can
see here a dramatic spike in cancellations
over many years. And we can also
kind of examine how these things failed through the lens of
our three pronged definition of reliability. So we bring up
our buckets again. So something like bad weather we
can kind of contextualize as product health. And this is something that's not
always in the control of the people working at the airline.
Something like bad weather certainly isnt. High demand is
also a customer oriented factor that puts a lot of strain on
these systems. Suddenly there were so many people wanting
to fly out for the holidays. It was way more flights needing to
be scheduled and filled and processed than
on any typical day. And then as a result, the systems start hitting
their limits. The algorithms and programs they have
set up to do things like automatic seat assignment to coordinate
different schedulings, layovers, making sure that everyone can
actually get on the flight, that they're meant to start breaking down
under this unreasonable amount of strain and sudden
changes. As a result, flights start getting canceled.
And this creates a huge domino effect where now
different connecting flights are being canceled, which causes flights
that they were meant to connect to, to be effectively canceled. And people
who have complex journeys of multiple layovers suddenly
have their whole house of cards falling apart. And what this resulted in
was a lot of unhappy customers, in part because this was communicated
poorly. People weren't finding out that their flights were canceled until
it was way too late to make alternative plans after they would already
arrive at the airport. And now they're just stuck. It really
broke down this trust between the customers of
the airline and the airline itself. If they can't maintain the standard of
communication, there is inconsistent messaging. People received completely
conflicting requests and confirmations
and were told one thing by one person and a different thing by another person.
And then on the sociotechnical side, we see downstream
of all of these changes in the customer's demands and their expectations and their happiness,
and then the systems that they were relying on. This creates a
sociotechnical resilience crisis. In many cases,
things that were done automatically by the systems now had to be performed
manually by people who probably weren't trained to do these
sorts of things. They weren't trained to have to process these
crisis situations. Also, as a result of the pandemic,
a lot of airlines significantly downstaffed.
They laid off people. They reduced in size as travel
wasn't really an option for a few years. And now, as demand has spiked
back to where it was before, if not higher, they're finding themselves understaffed.
And people are stretched too thin and unable to perform all
of their normal duties, let alone all of the additional duties of this crisis.
And flying is just one example where we
can see a crisis of unreliability play
out in these three distinct areas. But it is really everywhere.
In fact, I would say if you think of any sort of service you're unhappy
with, you can very quickly start contextualizing it in these three
categories to see how it really is. Always this threefold
breakdown that leads to an undesirable outcome.
For example, let's say you have a terrible cell phone provider. Well, what makes
them so terrible? Sure, maybe it's the phone
itself is not so great. It's missing features,
it crashes, it's slow. Maybe the
networking functionality isn't very good. Your cell phone
provider doesn't have good service where you live. Maybe it's a
customer happiness issue where they're prioritizing profits
too much at the cost of having good customer service,
good customer feedback loops, responding to the needs and
market demands. And maybe it's a sociotechnical resilience
failure where even if they have the best intentions,
the people who are meant to supported you simply do not
have the training to deliver on what you want. How often have
you called technical support and found that the person on the other end really
doesn't know how to answer your questions? They've been undertrained,
they're probably overstressed, overworked and underpaid,
and they can't live up to the expectations you have for them.
With many other examples. A car when you're choosing
a car, sure, you're thinking about the product health of the car, that is
its functionality, is it top of the line in its features,
its fuel economy. You're maybe also
thinking about how well does the visions of the car
company and what they want to deliver to you match your expectations and
what would make you happy. But you also really consider, is this car
reparable? Am I going to be able to find mechanics that know how to fix
it? Am I going to know that the people who I'm
going to rely on when my car breaks will be trained enough to
know what's going on? Think about your apartment building. An elevator goes
out of service. Well, that's immediately a system health issue,
but it also becomes a customer happiness issue, that you need to trust
that your building understands how debilitating it
is to not have elevators, how inconvenient that is,
understands how much of a priority it should be, and that they
actually have a function within themselves. There's someone who's either
trained to fix it, trained to know who to ask to fix it,
has the time and the confidence to move
this along quickly. If you don't have confidence in all three of these
areas, a lot of these services that we take for granted in
our lives can easily fall apart. So now that we've
kind of seen how naturally this definition occurs in the real world,
I want to challenge people and think, why don't we kind of have the same
standards for our technical solutions? Why aren't we
considering the programmers behind the products we choose,
the operators that are resolving bugs and incidents,
and generally just understanding that in
order to have a good product, there needs to be a confident team of people
behind it. So in order to kind of illustrate what this looks like in
the tech example and how it can kind of help you strategize and prioritize
in terms of improving your products, let's imagine three services,
and this is going to be a pretty simplified example. Nothing ever
is so cut and dry in the real world, but hopefully it can illustrate the
way that this sort of definition can inform your strategic thinking.
So we have three services, a, B and C, and we're going to
look at them in these three buckets of reliability.
So in terms of service health, that is all of your
typical monitoring metrics. Service A is pretty good. It has a lot
of uptime, it always runs smoothly, it's always returning the
correct data. Service B, it's okay, maybe it
has the occasional outage, maybe it has some slowdowns,
but it's chugging along. And service C, that's bad.
It has frequent major outages, it works inconsistently,
even when people are able to get requests through to
it, it's very slow. So you're thinking, all right,
I have some free engineering cycles, I want to shore
up the problems of my product and make my customers happy.
Which one should I look at first? You're going to say service C,
right? That seems like an obvious choice, but let's throw in the next
ingredient and contextualize this based on what are the user
expectations for each of the services. Well,
service a, it's kind of popular. It's a feature of
youll product that, let's say around half of your users
make use of semi infrequently.
It's not super critical for them to enjoy your product,
but it's certainly an expectation they have that
they'll be able to use it when they need to.
Service B is something that's actually in very
high demand. It's something that every customer, no matter
who they are, no matter what their use cases are, is interacting with and they
need it to work or their experience is as good as ruined.
And service C, let's say that's like a brand new feature.
It's something that is only being rolled out to certain customers.
The usage of it isn't all that high yet, and nobody
really is hinging their continued usage of your product
on whether or not service C is functioning. So now you're thinking a little
differently, right? If you improve service C,
sure, that's great, but maybe not that many customers
are even going to notice. It's going to have pretty small returns on
your customer satisfaction, whereas a, and especially b,
really do need to see some attention to meet up with these increased
customer demands. So now maybe you're thinking,
yeah, service b, that's the one I should take a look at. Finally,
let's look at this bucket of sociotechnical resilience.
Service A is a product that is very different
than anything else that's offered in your company.
It's something where the engineers aren't that trained, they aren't that
prepared. They haven't dealt with many outages of a service
like it. There's not runbooks in place, there's not
a typical escalation procedure for it. There's not communications
that you've already written up around it. Whereas service B and service C,
even though service C is new, it's very closely modeled after
something like service B. And all of your engineers are very confident
that they're able to resolve any issues that
arise with it. They've been trained, they've been through it
all before. All of this is kind of old hat to them.
So now you're kind of thinking, well, maybe I should be spending these engineering cycles
on shoring up service a proactively, writing some guides
on how to fix it. Sure, its system health is good now,
but as demands change and users expectations change,
suddenly it could be very unacceptable, and you have
to kind of proactively prepare for that. Now, at the end of the
day, it's not entirely clear just from this diagram
which service you should prioritize. And obviously, in real life,
things aren't going to be so cut and dry with such clean,
single metrics of mid, high or low.
But I hope you can see how this can be a guiding framework
to really uncover and highlight where there
are deficiencies. It's not just as simple as looking at monitoring
data, but understanding this bigger picture of how things would actually unfold
when something breaks.
So when we're talking about these buckets in the technical context,
they're all made up of smaller
factors. And usually these are kind of questions you can be asking
about your service. So let's break these down a little bit further
in the first bucket, product health. These are the things we know very well.
All the data that comes out of your product as it runs,
things that your telemetry is capturing,
looking at the stability of the code base. Sometimes it requires some meta analysis,
but all of this sort of like objective, cold data about how
your product is actually functioning.
This can be embedded in the program as it runs,
or it can be sort of a more black boxed approach where you're querying
your product in a production environment and seeing how well it responds
in terms of latency or the error rate,
the traffic you're getting, the saturation of your resources, these being
your typical four golden signals. These are
all very measurable, very trackable facts
about the way that youll service is functioning.
Customer happiness starts to become a little bit more nuanced,
but it really boils down to, are your customers satisfied?
Is the product healthy enough that they're happy to continue to use it?
What does their user experience look like? When we think about
users, it's maybe not so helpful to think about an individual
as much as a function. What are the steps that they take
when they use your product? How critical are
each of those keeps to their overall satisfaction? What types of
steps link together that if one breaks,
another one will break? Breaking down this user journey can really
make these questions about user happiness a lot more tangible.
You can start putting numbers to them and say, well, the login page
has to work 99% of the time in under five milliseconds,
or else users will start feeling like it's too slow or too unstable.
Youll have to really understand this idea of importance. Probably logging
in is more important than searching, is more important
than adding something to a favorites list.
But there's no universal prescriptive
formula to this. It's something that you have to get from your users, from observing
their behavior, from surveying them, to build up
this kind of statistical understanding of importance. This one is
definitely more of a nebulous question. But do they feel confident in your
product? Do they feel like it will continue
to work to their satisfaction? Does your business have a good reputation
for a reliable product? Do they feel supported,
informed? Are there places that they can ask questions? Are there
faqs where they can learn about failure modes
and understand what they can expect from the product? Do they feel
connected to you?
Finally, if we look at this sociotechnical resilience bucket,
this is where we turn inwards and think about the teams that are operating,
supported, improving and developing for
your products. So think about when
something goes wrong. How effective is your team? Is there
a lot of toilsome work that's repeated every time? Are they moving swiftly?
Are they moving effectively? Are they repeating work because
there isnt good communication between people and people are trying the same
solution independently over and over again. Is there
clear service ownership when something goes wrong?
Are the right people called in? Is there
a clear understanding of who should be the subject matter expert
of who knows the answers to the questions?
Are teams aligned on their priorities and responsibilities?
When something breaks, do you have a clear sense of how important it
is that it breaks? Is there a clear severity scale and
is there a consistent response that's proportional to that problem?
Are the on call loads balanced? Are there people that are overworked,
overstressed? You have to kind of look at this not just through the lens
of on call shifts, but the actual expected number
of incidents for different types of services, for different teams
of different specializations. In the end,
you want something that feels fair and transparent,
where everyone feels like they're putting in roughly the same amount of work.
Are people burnt out? Are there too many incidents? Are people
endlessly fighting fires? Do they feel like they're disconnected from the work
that they want to do, which is probably more planned,
novel feature work? Are people equipped with the
tools and knowledge that they need when something breaks, are they scrambling to
figure out what to do? Do they have no way of consulting previous
cases? Are there run books in place for them to work through?
Or is everything coming from the top of their head?
And does your team still function if somebody is suddenly away?
Are there critical points of singular failure where if
so and so from the login team is missing?
Well, they're the only one that understand how that code worked at all.
It breaks, nobody else has any clue what they're doing
and everything falls into disarray. These are some
of the questions you can start asking yourself to start getting a picture
of where your overall reliability is now
with this new definition. And it's in this last bucket that
we really want people to focus because the first two things,
I think most organizations already have some function to capture.
But this looking inwards and really understanding, what are
the stress points for your teams? Where do your teams feel confident and
where do they feel insecure? That takes a lot
of deliberate work, a lot of reflection,
and should really motivate a lot of change.
And what does it do for you once you're kind of aligned on?
We really should prioritize looking at these three buckets.
What benefits should you expect to get from this alignment
on reliability? Well, first off,
the value of just aligning in the first place is huge.
So often in this world of distributed teams and microservices,
where system architecture is so complex, it creates a
lot of inconsistent communication, prioritization and
understanding between different teams. But if you have the singular definition
that any incident can be viewed in the lens of where
you're constantly thinking about what is the overall impact on our
product's health, what is the impact of that on our customers happiness,
and how equipped are our teams to remedy that.
All different issues across all services can be kind of
put into this one singular language which creates a whole
lot more of an alignment, motivation and engagement to improve these
things. These motivation for impactful changes
is also absolutely huge because it's
sometimes hard to get your teams to want
to invest in reliability. There is an eagerness to want to
launch new features, to gain that competitive edge,
to get ahead of competitors on having all the latest
and greatest tech. But by showing
that this definition links into the
health of your product, the happiness of your customers,
and the confidence and capability of your teams, it becomes
clear, like what could really be more important than that?
Finally, it prioritizes where these changes are
needed by turning the question of is our product
reliable? Into something that's a lot more measurable, that's a lot more
monitorable, that's easier to assess.
It creates a situation where you can see what improving reliability
actually looks like in practice. It's not just this sort of vague
goal of let's make our product more reliable,
but it'll show you the clear outages of we need
to fix this so it meets customer expectations. We need to trained
this team so we don't have another twelve hour outage.
It points you towards the changes that are most impactful
and needed most.
So I mentioned a lot that one of the goals of thinking reliability
like this is that it makes it more measurable, it transforms it into something
quantitative from something totally qualitative.
And how you do that exactly will vary a lot from organization to
organization. But what we think this definition outages
a lot of is sort of a question and answer format
that points you to the numbers that will be most valuable.
So here's some examples you can think about.
What are the sources of manual labor for each type of incident?
Think about all of your most common types of outages
or slowdowns or errors or whatever.
When those things happen and they're being resolved,
where is there a lot of toil? Where is there a lot of steps being
taken manually? You can look at
each of your engineers that work on call and think about how many hours
have they spent actually in the trenches of responding
to incidents, not just on call shifts, but actual on call
work where they were paged and had to suddenly
jump to their computer. How much time has your
team spent fixing each service? Think about all your
different service types and just come up with a total number of hours
spent fixing them. So it's not always trivial to
get these numbers. It's not like there's going to be a program that can spit
them out for you. And it's not like these numbers will actually tell
you the entire story, but instead it's in the process
of gathering this data and having these conversations
that you can start to really figure out where your strategic priorities should lie.
So let's say you're trying to
list off all the different sources of manual labor for each type of
incident. And one has a lot.
Oh, jeez. This thing, it goes down every couple of weeks. And every
time it does, we have to do this, we have to restart that, we have
to run this script, we have to coordinate with this guy, that guy has
to run this script. It becomes very
tedious and toilsome. And once you're starting to tabulate these lists,
once you're starting to quantify it, that leaps right out. You can see
that that's an outlier and you should go, yes, it's time to
invest in automating this. When you're looking at the number
of incident hours, you can see easily, wow,
the team for this incident type is always busy. We should
consider investing a few more people into this to
pick up that slack, to make
sure that there's nothing slipping through the cracks, to make sure this
team isnt going to get burnt out. Once again,
just trying to quantify things tells you this story that you can
investigate further. And this
one, every time you log some hours on
a service, fixing it, repairing it,
upgrading it, you become more experienced with it, you become more confident,
you're capable of dealing with a greater range of issues that
can emerge from it. So if you find an outlier, again,
where, geez, this service, thankfully, it hasn't gone down much
at all, because our teams have really no experience at all in fixing it.
That's a sign that maybe you should proactively practice. You can run some
drills, you can write some runbooks, you can prepare resources
and processes that will get people confident even
in the absence of the real world crisis.
So measuring this thing, this brand new definition of
reliability, it isn't trivial. It's something that varies from
organization, from product, but it's something
that's really worth investing your time in. And the goal, again, isn't to
come up with one magic formula that answers every question,
but to reveal what the right questions are,
to find outliers and to decide priorities in the
process of investigating the numbers behind them.
So, in conclusion, our new definition
of reliability is your system's health that we know
very well, contextualized by the expectations
of your users, what are adequate numbers?
And then prioritize based on your engineer sociotechnical
resilience. Where do they have confidence and where do they
need trained and help in order to keep everything running smoothly.
Here are a few citations around the cost
of downtime in tech companies and also the crises among
airlines last holiday season, so feel
free to investigate that further if you want to see just
how important this question has become to so many
organizations. But I hope you've enjoyed my talk.
I hope I've opened your eyes to a new way of
thinking about reliability and motivated you into investing
some serious time into building up that practice
in your organization, too. Have a wonderful day.
I've been Emily,