Transcript
This transcript was autogenerated. To make changes, submit a PR.
Okay, starting the SLO's implementations. So I
will start from the problems, guys. So these kind
of problems that I've been stated on my slide,
I've been filtered and I think these three problems,
it's a common problems for a tech company. I guess the first thing
first, it's third party or your business partner said that
their services that you use will downtime for a while.
This is a problem because we don't know what is
for a while definition. It is 1 minute, is it
2 minutes or I mean is it two years?
Because we didn't have any standardized for
our services that could be allowed to
retrieve an error from the third party. So this
kind of problem, this is the foundation problem
why you should implementing the slos. And the second one is your
product team had a debate if the new deployed surface was stable or
not. Because I think it
is very different about how product team
works and the operations team works. Right. Because product team
should be very agile, very dynamic,
it should be tolerable with the change.
Because in order to improve the product itself,
it should be changed. Right. But in the other hand, the operations
team was very,
it should be stable. It should be stable and it's
kind of like how to operate the software
more safely. So this is also a problem because product
team wants to pass very oftenly the chains
or the new products or the new services, but the operations
team wants to keep stable,
keep doing everything on
the settlement state. So it's
kind of problem also. And the second one
is also the foundations why you should implementing the slos. And the
third one is there is a confusions on down definitions
between your operations team and product or engineering team.
Yeah, like I said before, because the product
or engineering team wants to be deployed fast, very fast,
very often and very dynamically. But your operations teams
in the other hand wants to be very safely, very carefully,
very stable to operate the software
itself, to keep maintaining the reliability of
the software itself. So SLO
is coming to help you. So what is the SLO?
SLO is a target value or range of values for
a surface level that is measured by an SLI.
By the way, I'm quoting these definitions from the book called
site reliability Engineering. The book was from Google and
I think it was a great book to start your journey, your site reliability
engineering journey. So you should read it. So from
these definitions we found a buzzword again, it is
the SLI. So we
cannot understand SLO completely before we
should learn about the SLI first. So you should
start with the SLI. What is the SLI. SLI is a carefully defined
quantitative measure of some aspect of the level of service that is provided.
It is a lot of buzzword, but on my word is the metrics that
you want to observe. I mean, the metrics can be everything, right? Like your
surface latency, your non error response
of your services, and then your saturations
of your instances or the surfers.
So the metrics can be everything. But on
the specific SLI, it is the metrics that you want
to observe. So I will explain to
you what is the characteristics of the SLI on
the next slide, after, you know,
what is the definitions of SLI, going back to
the SLO definitions. So on my word, the SLO
definition is the target value or range of values
from your metrics that you want to observe.
For example, your SLi could be your
latency of your services, let's say. So if
you see your moving average of your latency, let's say
your latency, it's average below 200 millisecond,
let's say that is your Sli, but your SLO
should be a target from your SlI itself. So my
target average of latency should be
below 100 millisecond, let's say.
Okay, so then by default, you are breaching your SLO,
right? But we will not talk about this now.
But you could differentiate it, right?
So the SlI, it's the metrics that you want to observe,
but the SLO, it's the target of your,
okay, the target of your SLi that should be achieved
from your SlI. That is the definitions of
SLO. Okay,
but I think I know what you're thinking
now to start implementing the SLO,
you will feel like. But I don't know what metrics should I observe
for now. So there is a lot of metrics that you can
achieve from your entire system, I guess.
So there is a ton of
metrics that you can retract from
your services or your products or your infrastructure,
I guess. But the thing is, if you sre on the
Google SRE workbook. So if you have
no idea about what metrics should I retrieve? So you
can follow these four golden signals. You can start
to collect the data about your latency,
services or latency of your systems,
I guess. Or the second one, you can start
to collect and retrieve your traffic metrics.
And the third one, it's the errors metrics.
So I have been stated also on my slide,
what is the error means? So the rate of requests that fail either
explicitly, it's like the HTTP five,
xx, or implicitly, for example, an HTTP 200
success response. But coupled with the wrong content or
on some other applications, every error or unexpected results
about it is categorized through an error.
So that is the definitions of errors.
You can start collecting the errors metric also from your services
or your products or your infrastructure, I guess.
And the fourth one is the saturations.
And the definition on the saturation is how
full your surfaces is. So I've been note also what is
the definitions of saturation. So going back through your
worried, so I know you kind
of confuse what is the metric should I observe
if I have no idea about the metric itself? So you can follow the
forbulden signals. It's also stated on the Google
SRE workbook. So these four metrics at least you should
retrieve from your services or your infrastructure.
Okay, so you can start to collecting, or if you
have no
surface to collect the metrics, you can start to deploy the surface
itself like Prometheus or surface
mesh. You can also using the surface mesh if
you have kubernetes cluster right to retrieve these kind
of metrics. A lot of procedures
to retrieve these metrics. But the thing is,
if you have no idea about what metrics should I retrieve,
these forbidden signals can help you to define
what metrics should be in your SLi.
Okay, now I know what should be measured and then
what? And the next step is
start to create your SLI after
you know, what metrics should you
build the SLI, what metrics should you observe
and then start to create the SLI. There is two steps
on how to build the SLI. First, you should
build the SLI specifications. So the
SLI specifications contains the definitions about what
you want to observe. You can detail it
like my previous example.
So it's like, okay, so my SLI specification,
I want to measure the average latency within
a month from my surface a, let's say.
So that is the definitions. It is one
of these SLI specifications. And the second one on the
SLI specification is usually it's a form of percentile
or percentage between some events and total events. So it's
like the portions of the target of the
events that you want, that you want to be observed divided by every
events that has been occurred. Okay,
so let's say you want to measure your
average latency below 200 milliseconds within
a month. So your SLI should be
some of the requests that had a
latency below 200 milliseconds divided by total
request. So the SLI should be percentile
or percentage. You can imagine that.
And after you build your SLI specifications,
you can going through the second step, it is this SLI implementations.
So you know what your SLI definition is.
But now you should thinking again where
you can get the metrics. How can you get the metrics?
Also, if you're using the Prometheus, you should use ProMQl,
right? And you should learn on how to querying it and
how to aggregate it to
fulfill your SLI specifications. But the thing is,
the generic formula to create the SLI implementation is
a good or target events divided by total events times
100%. Like I said before, usually it's a
kind of percentile or percentage. So once
again SLI is a good events divided by total events
times 100% like my
previous example. So sum of all
the requests that had latency below 200
milliseconds divided by total request
times 100%. Okay,
that is my SLI specifications example. You can
build it from now another
SLI examples. So on my SLI specifications
I stated that I want to measure HTTP response,
that return non error response. I define
also what is the non error response. It is the two xx
or three xx to the client response.
And on my implementations I will retrieve my
HTTP response counting.
I will count the total of the
HTTP response that fulfill my SLR specifications and also
the total request response to the client.
I can query it from the API gateway metrics, or I can
query from the surface mesh, or I can
querying it from the it's like
the cloud player or
some other cloud proxy services. It's also providing the
counting of HTTP response. So my SLI
should be a sum of two xx response
plus sum of three xx response divided by all
the torque requests within a time window within
some particular time times 100%.
So I can retrieve the percentage of
my target events. Okay, so that is
my SLI. So this is the example. So you can start
to first thing first you should find your metrics
and then build these SLI specifications. And then try to build the SLI
implementing, okay, and then after you had an SLI,
you should choose your time window.
The first time window you should choose, it's the evaluations or
aggregations time window. So you could imagine that if
you had a web server, that it's a normal web server that's serving
the HTTP request. You could imagine that your client
request doesn't have the constant
time rate, right? It doesn't constantly
request every 1 second, and then sometimes
it is every 1 second. Sometimes it is per one millisecond
request, I think. So there is no constant
time for the client request. So client
can request every time,
right.
But in the SLI, you should choose your aggregations time,
what time you want to evaluate it. Is it per 1 minute
request or is it per five minute request? Or is
it every ten minute request? So you should choose your aggregations
of your metrics time window. Okay. And the second one
is you should choose your Slo time window. So you
could imagine that if you had a web server that
had been running through ten years,
you want to measure the SLo for
going back to ten years, right? It didn't make sense. Right? So you
should choose your Slo time. Is it per week, is it per
one month? Or is it per one years?
So you should choose your Slo time window when
you want to evaluate your target. Is it rich or not?
Is it breach or not? You should define
also the slo time window. So after
you choose your time window and you had an SLI,
then you should set the boundaries.
These are the steps to set your SlO boundaries.
Okay, first thing first, try to visualize it.
Try to retrieve the metrics, your SlI metrics,
and then visualize it within your
selected time window as well.
So after you visualize it, you should find the average movement
of your SLI. Okay? So I think the movement of average,
it's between 90%
until 95%, I guess. So that is
your average movement. So I will show to you the
example on the next SLI, but bear with me. So these
are the steps to set your first boundaries
of your SlO from your Sli.
And third one put the boundaries below the minimum point or average
movement. Okay? So there is
another point that you should pay attention. Slo. The first
thing first is if you put the boundaries on top of your average movement,
then by default your system was breaching the SLO.
Okay? So that should be noted
by you, because if your
average movement of your metrics or of your SLI,
I would say it's between 90%
until 95% success. But you
set your boundaries SLO first. Boundaries SLO,
it's on 99%. So by default,
your system couldn't achieve, right. It means that your
SLi, it's breaching your slo
by the first time. Okay? So it is not good. So you
should find the average movement of your SlI.
It's more easier to visualize it first as well.
But after it, you should find the average movement of
your SLI and put the boundaries below
the average movement of your SLI. And the second
point is it usually start from 90% the
boundaries or your first SLO, your first target on
your SLI, it usually start from 90% but
it is not allowed.
But usually it standardized start from
the 90% and then after 90%
you can improve your SLI metrics by tuning your
surfaces I guess. And after
your SLI is increasing, the average movement of your SLI
is increasing. You can start to incrementing it
to 95%, but you can do whatever your number is.
But it is the general one. And the
third one is you can increase your SLO after your normal circumstances
is also going up. Okay, again, so if you
put your boundaries first on
top of your moving average, then I believe your
SLI should be breaching your
slo or your boundaries.
But let me check on the example part for
more better understanding. Let's see the examples.
So I have the SLI specifications,
which is HTTP that return non error response for every 5
minutes. So I will break down my Sli
implementations, I mean my SLi specifications,
Slo five minute matrix aggregation times, which is one dot matrix
on the graph and then 30
days of my slo time. So that is my chosen
time window. And then going
back to set boundaries step.
So we should visualize it so we can see the graph itself.
And if we see the average movement of
my metrics, it's sometimes it 100%
per every 5 minutes and
sometimes it's below 85% and
even it is reached to
the 75%. Okay, so if we see the average movement,
I can set it from the 95%, but you can
do the calculate of the median of your metrics
for it. For the better of set
the first boundaries, you can set them on the median of it.
But for this simple
guessing, so you can see the average movement of
your graph. Okay, so I will set
my first set boundaries or my slo
to the 95% of Slo.
So for better understanding this graph. So I
will explain to you that every dot on
this graph, every single matrix dot on
this graph is implementing the 5 minutes metrics of aggregations
of the request of the HTTP
non error response divided by total request in average
on every 5 minutes. That is the one dot on this graph.
Okay, that should be noted by you.
So how did we know that our surface or our
SLI in this case breached the SLO? So if
the sum of matrix below Slo is
greater than one minus Slo times slo time window
divided by aggregation time window, so then
we breach the Slo. Okay, so for better understanding
SLO, we back to best on our example, we have 5 minutes
matrix aggregation time. Remember, there is one
dot matrix on the graph and we had the
30 days SLO time and we
had the 95% of SLO.
So we have an allowed error response within
one -95% time 30 days.
Okay, what do we mean by this?
Because our target SLO, it's just
not achieved the non
error response below the 95%.
Okay, so we allowed the error happens just
5% between 30 days.
You can imagine that, right? Slo once again,
because our target is returning
to the client with non error response with
95% within the 30 days.
So we allowed the error, we allowed the
error one -95% which is 5%
in the 30 days. Okay, so we
tolerate the errors within 5%
between 30 days, which is 36 hours.
And because we have a 5 minutes aggregations matrix,
then we have 36 hours
divided by 5 minutes of allowed under
95% matrix one dots which
is 432.
You got it right. So if your
five minute matrix. So if one dot matrix on this graph,
if one dot on this graph,
on this visualized graph and
below the 95% below
our target and tends to appear greater
than 432 times within 30
days. So if we manually calculate this
1234-5678
910 if we calculating it manually, right, it's this
estimate. So if we calculating it and
sum all of the dots that below our
SLO and it tends to appear
greater than 432 times within 30
days, then you breach your 30 days slo.
Okay, it should be makes sense to you, right?
But this problem can be understood better when we introduce the error
budget concept. So the error budget is actually one minus
slO, but it has greater capability to detect
the SLO breach within
the concept of error budget and error rate.
But I will explain it later. So to quick recap on how
to build your first SLO, the first thing
first, you should build your SLI specifications.
If you're still not sure about what to be measured, see the
four golden signals. After you had
SLI specifications, you should going to build your SLI
implementations as well. So you
should start to think about where and how you can get those metrics
right. And how can I formulate
those metrics. So my implementations should
be matched with my SLI specifications.
After you had SLI implementations,
you should visualize your metrics,
your SLI within. Also the
next step, set the time window, set your metrics aggregations
time window what the
timing node that you want to
observe. And you also should choose your
SLO time window when you will evaluate that the
SLO is breached or not.
And after that, you should see the average movement of your
SLI. After you visualize it,
you can see the average movement of your SLI. And the next step
is set the boundaries below the average movement of your SLI.
So I have given you a
reason why it should be below the average movement of your SLI,
right? So it's actually your
boundaries. It's actually your first SLO. Congratulations.
You have the SLO. Now then, if we're going back through
our problems, it should be the
solutions after we implementing the SLO, right?
Like the first case, the third party or your business partner said
that their service will downtime for a while. So it
shouldn't be for a while. Okay. Because you know you had
an SLO that you cannot return
can error more than x minutes.
So you should talk to your partner that we had an
SLO. So you cannot go into the maintenance
time or downtime more than 5 minutes, let's say. Okay.
And the second problem is,
it is okay for an error. Okay. It is okay for can
error. We tolerate can error as long as the slimetric still on
top of the slo. The third problem,
it's also okay for returning the unexpected result when
product or engineering team, it's told
that their service was returning an error
or unexpected result intermittent.
So operations team can say that it is okay
for returning unexpected result as long as the SLI is still on
top of the SLO. SLO. The operations
team, the product team, the engineering team had the same definitions
on what is the down definitions? As long
as it's not breaching our SLO, it's okay.
It's also same with the second case.
So whenever product team wants to
deploy it, very vast, deployed, very agile and operations
teams, it's too scared to
being very agile. So product
team and operations team can have a bergaine about
the okay. So we had our
SLI. Still about the SLO. Let's doing
some experiment stuff. Because it is okay for
error as long as the SLI metrics stay on top of the slO. So for
next topics, I will talk about error budgets
and error policy. And these concepts will help
you understand better on when the
SLO is breached or not.
And next, I will talk about how to calculate the SLO for integrated services.
Let's say surface a had an SLO and then
had the communications with the surface b also had an SLO.
How can we calculating the SLo between all
of those services that was integrated? And next
I will talk about the alerting on slos. And the last one
it's the interesting part. Why can rerelease can improve your slos while
maintaining agility? By the way, on the number one, number two and number three,
I had written on my article. So visit
my medium and follow so you can
read it. And the fourth one, I still progressing my
article and doing some math on it and
think that's all for me. Thanks. See you.