Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Ramon Medrano. I'm a cyber
reliability engineer at Google. I work in the identity
team in the third site that we have, and I have been doing this for
the last decade, believe it or not. I'm very thrilled
to be speaking to you today about how we measure
reliability in production. We are going to be doing
a small workshop of creating a small SLO in a distributed
system and giving some hints
of how we are going to be creating or what questions we are going to
need to answer when we are in the business of creating
an SLI and SLO, et cetera. So let's cover that all.
So the first introduction I want to make is the most important
feature of any system or service is the reliability
it has towards its customers and clients. In my
opinion, this includes security as well, because any system where we
have, for example, in the cloud or any online shopping
website, et cetera, will have to convince
the users to trust them with their data, for example payment, or if you have
a storage system with their own files and data, et cetera.
SLO. The systems needs to be reliable in the sense of needs to be
available for the users anytime that they need it and needs
to be secure. So we are not leaking any data to external
actors. The second production is what SRE is.
I think everyone signing up to this conference more or less know what it is,
but just 30 seconds introduction is what you get when
you get to treat an operations problem like running a
system or like operating a distributed system
as a software problem, meaning that you are going to get automation, you are going
to get to write software to manage these operations instead of just
running them in, churning through tickets or interrupts,
going to the matter, one of the most difficult questions,
or the most nuanced questions that we
have. When we are applying the SRE practice to
a system, either that is being created or a system that already
exists is what is the level of reliability
we need? Answers like 100% are
not correct. Answers like we
shall say are neither correct for two reasons, 100% is
not achievable. Like anything in a distributed system is subject
to break or any problem that we have is
going to be showing up to the users at some point we shall
see. It's neither good because we can't have expectations on
the systems. So we are going to try to see a process to
answer this question. So before
we go to the process, here are some terminology that we are going to
use through the talk. First of all, we have CuJs. Cui stands
for critical user journey. What it means is the definition
of the functionality that your users care about.
For example, if you have a shop, you care about people being able to browse
your catalog, you care about people being able to put
things in a cart or something like that, right? And you care about people
being able to pay. And as well, you might care about people,
for example, tracking their orders while they are shipping to
them, right? So this is a functionality, core functionality from
your service or your platform. That is something that is important
to the user. Then we have slice.
Slice. They are metrics. They are service level indicators that
describe what the user experience is.
With regards, some functionality could be a functionality as complex
as checkout or as concrete as
storing in my redis cache one element, right?
Because you might want to have slice as well for these subcomponents
on your system. Slos, they are the objectives
we have for the service level. So they are the objectives we have
for the user experience in different parts of the application.
Like for example, we might want to have an objective that like three
nines, like 99.9% of our applications
is serving correctly for the functionality of storing
a small value in rallies. Or like 99.99%
of the cards successfully are checked out
after the user decides to do so. And then
we have slis. Slas is an agreement, it's a contract. So an SLA
basically is a contract between you, for example, as a service provider and your
customers, indicating like if your service level
indicator goes under 13, SLO for some time,
you will give them refund or you will give them some credits
for your platform or whatever it is, how you build for
your clients. So with terminology introduced,
we can go to slos. And why SRE cares
so much about slos all the time. Basically,
slos are the lingua franca that we use across the whole
business cycle of a product or our service, right? So from
the concept, then we have a business description of like,
okay, so we want to do machine learning, right?
So how we want to do that, which service we
want to introduce it? Is this like a new service? Is this like upgrades
to a service that exists, for example, et cetera. Then when we have
definition for the business of what this means, this project, this service, this platform,
whatever we are building, we shall go to development.
So we are going to go write code, design components,
laying down in production, start to have some traffic, et cetera.
Then we launch the service and then we have operations. So we are going to
have to do weekly rollouts or like daily rollouts
or whatever is your cannons. We are going to have to monitor, for example,
the new versions that they are correct. We are going to have to make sure
that we have data entirety, running backups, et cetera. And finally,
all that stuff goes to the market. And in the market, if we have service
that produces revenue, for example, we're going to have to manage that.
If we have an internal infrastructure, we are going to have to discuss
with our clients, like if this performing correct, if we need to do more
functionality and so on. All this gets
aligned through SLos. So Slos get discussed with the product
team, gets discussed with the developing team and gets discussed with the SRE or the
DevOps team. So we have an agreement of like this
is the level of reliability that we want that makes sense for the business
and it's reasonable to implement within some time frame
or cost. The thing is, if everything goes
through slos, it's like how we are going to go to create an slO.
So this is what we're going to be looking at in
the next minutes. So first
of all, one plug in there if you want to
play with slos. So you want to have
like a small test bed service that we have that we are going to discuss
in the next slide. That is our hipster shop. And you want to
play about getting slis, defining slos, seeing how
they evolve, create some load, et cetera. You can use this cloud operation sandbox.
It's based on GCP. And you are going to get to for example,
deploy one small distributed system and start to get some synthetic
load and start to see how, for example, if you inject
faults, how it affects the slice. Like what is the reliability
of different things. The system
that we are going to use as a matter of the example for this is
a distributed system that we call hipster shop. It's available in GitHub
and you can deploy it in many places. Just for example,
if I have cloud shell in GCP, you can just run all these
services and have a small chop that will have different services
written in different languages, interacting and sending rpcs
to each other. There is going to be even a database. So you can play
with different classes of slis, you can play with different classes of distributed
systems and languages as well. So first
of all, how do we start creating an SLO?
The first thing that we need to think about is the CUJs, the critical user
journeys. The critical user journeys is the interactions,
is the functionality that our customers and users do care about
deeply, right? Is the interactions or is the APIs
or is the functionality that gives or defines the success
of our product. So we need to list them and order
them by business impact. For example, in our shop there are three things that
we want our shop to provide to the user. That is, we want the
users to be able to browse our catalog, check out whenever they
select something in their cart, and add
some production to their cart. If we order them by business impact,
the thing that is more important for us is for people that is already having
a card to check them out so we can actually proceed to do the sale.
Add to card is the second one because we want people to
be able to create cards that we can check up later and finally we can
browse products. This is a simple list, that's an example,
but in different business the list will be different. Or you might have
different CuJs in the same priority, depending on how your product
comes to be. So a critical user journey.
I think one word that needs
emphasis is user. So you need to think of the
coas that you are defining from the point of view or of your customers.
It's not something that is as internal as for
example, you happen to run a redis
cache and you want to have a cua that involves explicitly the redis cache
because that could be somehow leaking your abstraction to your customers,
right? So if you are an infrastructure provider,
that is, for example within a company, you are the one that is like me,
that is running the authentication service for Google,
you might have cujs that involve infrastructure in the sense
that your users, there are going to be other products like for example, Gmail needs
sign up to work for issuing credentials to access the
mailboxes of people. That's fine. In this case,
your user is going to be the product that is calling you as an infrastructure
service. So your user in this case would be Gmail. So Gmail
could be says as Gmail. I want to see for
example, user credentials being generated properly for my product to continue to
the mailboxes. That could be a variant for
an infrastructure Cuj in this case.
Then once we have the cuJs, we need to create indicators
of the healthiness or the successfulness
of these cujs. So when we say like we
are going to be looking at the checkout service for our
customers, how is the indication of or
what are the metrics that we can use to describe how successful
this service or this coa that we are providing to
a user is being? Right. So we need to have metrics
as simple as possible, but sufficiently rich so they capture
exactly what the user are expecting us to provide
them. So there isn't like a balance there that we need to strike.
Well. So slis,
we have first of all different types of SLI depending on the
services or the platforms or the programs that we
are running, right? If it's a transactional service,
like a classic one that is doing rpcs to other
services, for example, we have an endpoint that people can do, even people
like persons can do like transactions in the sense there we
have the classic availability, latency and quality. So we might
say we want so many requests to succeed in less than x
milliseconds. That's the classic SLI, right?
Then we have data processing, which is like for example, you have a pipeline
that is iterating through databases or you have
processing that is asynchronous, right? There you might have indicators about
freshness of the data that you are production. You are going to have
indicators about the coverage. Like for example, you might want to be summarizing
data. So you might want to have an indicator of like for each run
of your batch job we cover like 90%, 80%,
99% of our customers,
right? And throughput is what, how many rows per second, for example,
you are processing. If you are running storage,
you might have throughput as well. Like how much queries, how much
rows, how much data you are processing and what is the latency
that you are having to process queries. For example, if you run an infrastructure service
that is a data lake, you might want to say okay, so we are able
to process these many queries per second and each query takes
like so many milliseconds to optimize and execute.
Then we need to like once we have the type of the SLO,
the SLI, sorry, that we are going to be using, we want to specify like
the specification is like going to details for this particular service instance,
right? SLO for availability, for example, we can
say like this is the proportion value of the quads in
the sense of like we are serving 200 that are served
in x latency, right? We might
want to include latency and availability in the same SLI or
not, depending on how we want to play this description
of our service, right? And then we need to implement it.
The implementing is like okay, given the services that we have
and the components that it has, how we are going to get
the metric to be calculated in non abstract
way. So we are going to need to say okay, are we going to use
events that our application is logging into the APM,
for example, are we going to use logs that they come with a slight
more latency for example, but they are more precise. Are we going
to just instrument our applications or the export metrics directly so
we can have a system like Prometheus, for example, to scrape them and calculate
the stuff? Are we going to instrument our clients
or are we going to just treat our front end services as a
proxy for that? So we have, for example, less complexity,
but there is like we don't incorporate the latency that the networking the last
mile will introduce to the user experience.
So those are decisions that we need to take to calculate and to implementing
specifically the SLI that we want to show to the users.
We want to measure from the users and show to the teams.
So in our case, like for the checkout CoA
that we were processing, we're going to be focusing into these
two components of our application, right? So we have many components, but this
CoA covers specifically our front end and the checkout
backend service. That is the one that is going to be doing the business
logic for the checkout that will, in their
terms, developing other things. Like for example, when you do a checkout,
you might want to call the payment service typically,
right? And an email service to confirm the user that the order
was successful, right? So the SLI that we're going
to be implementing here, this is an availability SLI.
So it's a transactional service. So we are going to be doing that
howto we are going to implement it is going to be the classic proportion
of valid checkout requests that they are served successfully.
A successful request is going to be something that will have
200, right. And we are going to actually implementing
it by implementing the front end. So we are going to incorporate
this metric. Checkout service response counts, right.
And we are going to exclude the 500. We can see the like server errors
and those are not successful requests.
This example is using istio service mesh. But wherever you want to
propagate that metric is where you are going to make the calculation.
So then it comes the SLO, once we have the SLI,
and this is the hard part, right, because calculating
an SLI, implementing an SLI is just a descriptive metric of a
system. And your developers, your SRE team will have expertise
in that, so can get into an agreement of like this
is the indication. But how much we
want to target of this metric to go,
right. In this example, for example, we have the classic three nine.
So 99.9% of the checkout requests should be successful
for this SLI, right. SLO 99.9
is going to be the target for the SLI that we defined before. And if
we are over, we are good. And if we are under, we might even have
some SLA contractual obligations to fulfill,
right? I say this the hard part, because this involves
cost, right? So in the n, slo, think that
if you are one nine, at one nine, you are cutting your error budget by
ten. So in these three nines, SLo, we are going to have 0.1%
of our requests as budget for failure. So if
we fail all these requests, we are still good. So we can use this
margin to say, like, for example, we want to do our rollouts faster, or we
want to take some risks on the schema changes
or whatever, is what the team is prioritizing. If we
add one nine more, that sounds great,
because we are going to go to four nines, but our budget is going to
be one 10th of that. Therefore, the complexity of the operations will multiply
by ten ish, right? And therefore the cost of
maintaining and operating the system will be ten times more expensive.
So we have to be very careful of having slos, that they are achievable and
that they are something that is representative for the users
of expectations towards the system.
Typically, one thing that has worked well for me in the past has been
to incorporate in the same room production developers and
SRE and say like, okay, what SLO should we do for the business
and product will come up with like, well, we need the highest possible,
right? Because that's great. And then you specify
the cost. For example, sure, we can do like five, six nines, but this
is going to cost you this much headcount, this much development time, this much complexity
in the deployment of the code, whatever that is. Right. And then things
will come to balance of saying, okay, we can now achieve
this cost, for example, for developing and operating the system towards this
SLO, which is something within the user's expectations.
So, summarizing on the process to create
appropriate slos, we have, first of all, we need to list the user
journeys and order them by business impact. This is very important that our
product teams, they are involved because they are the ones that they know very well
what is the business impact and what is the expectations of the users.
At this point, you can as well get some indications of the criticality of different
things. So the user journeys that they are on the
top of the list, obviously, probably are going to be like
receiving higher slos, because they seem to have more impact to the business.
And critical user journeys that they are down the list,
probably they are just less relevant, they are accessories, et cetera. So you might want
to have more headroom for the slos in those.
Second, you need to determine what are the indicators that describe the
successfulness of these Cujs.
Depending on the CuJ that you are considering, or depending on the component that
is involved into the CuJ,
the type of the SLI is going to be different and the implementation might even
be different as well. And then you need to go back with product and
development and say like, okay, what targets do we want for these Slis?
Like what are the objectives that we want to meet here? Complexity and cost.
They are going to be as well an important component to discuss.
And as well you're going to need to hand define a measurement period.
Right. Slo, you might want to have like sliding windows or only considering the
natural month, depending on what is the
characteristics of your services. Finally, you can
just implement everything. Write the code to export the metrics to calculate, to have a
batch pipeline that is going to process the logs, whatever is what you do.
And finally, you're going to have to deploy some alerts. The nice thing about the
alerts is that they become pretty simple to implement because you have an SLO
and an SLI. So the alert is going to be whenever we are having an
SLI under the SLO for 30 period,
like, I don't know, 1 hour, five minutes, 15 minutes.
The higher the slO, usually the smaller the window.
You just trigger an alert.
If you want to know more about SRE, if you want to know more about
the practice, if you want to know more about how to implement these things in
your company, you have these books. Now, we have like a family of four books.
The first book is the one that defines the general principle of SRE.
The second one, the workbook, is really focused on
how to implement the first book in existing organizations.
And the other ones, they are very specific. So if you
want to talk about security and reliability together,
there's the third book for you. And the last one is like a
version of the workbook, but is specifically tailored
for large organizations, like for example, large enterprises
and so on. So how you can steer the culture within the
company to have SRE in there.
Thank you for watching. Thank you for listening. I would be happy to answer any
questions that you have either in the chat or in the Twitter
or any other social network that you use and see
you around.