Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Spoons and thanks for joining. I'm here to talk about driving
service ownership with distributed tracing.
Before I get started, just a little bit about myself. So my name is Spoons.
Although my full name is Daniel Spoonhauer, no one calls me that. I'm CTO
and a co founder at Lightstep, where we provide simple observability for
deep systems based on distributed tracing. And I spend a lot of time at Lightstep
thinking and working on both service ownership and, not surprisingly,
distributed tracing. Before I helped found Lightstep, I was
a software engineer at Google. I worked both on Google's internal infrastructure team
and as part of Google's cloud platform team. I worked closely
with SRE in both cases to build processes and roll out
tools to improve reliability and reduce the amount of work for teams,
and in both cases, hands. Subsequently, at Lightstep,
I carried a pager for many years, although they don't let me do that anymore.
So great. So before I get too deep, just to kind of set some context,
I want to talk about this question. What changed? I think this is a really
important question generally for SRE, but I want to talk
about what changed in the kinds of technologies that we use,
the kinds of architectures that we build. And these was that we work together
as well. Many engineering
organizations are adopting services or other kinds of similar systems,
often aside DevOps practices. But what does that mean for how we work
together? So there's been a series of technical
changes as we move from bare metal to virtual machines, to containers,
and finally to orchestration tools like kubernetes.
And each of these has provided additional abstractions, right?
They've raised the level of the primitives
that we can work with. They've also introduced additional complexity.
Hands more was for those systems to fail. So there's a trade
off there. And of course, even though we have those abstractions,
someone still needs to understand what's happening underneath the hood. Partly enabled
by these. We've also changed these way that we build these systems, right? So we've
moved to microservices, maybe we're using serverless.
These are all ways of building more loosely coupled
applications where different pieces of it can be deployed independently, can be scaled independently.
And speaking of independently, we've also
adopting methodologies like agile and DevOps.
I'll talk a bit more about DevOps, but I really think of DevOps as a
way to allow teams to work more autonomously, more independently,
and to boost developer velocity. And while
that's good in a lot of ways, that's also created some challenges. So I
like to put this kind of in the form of this feedback loop, right?
So if we look at the bigger picture in any software system, and really
this goes for a whole bunch of other systems as well, we have a kind
of feedback loop, and on one half of these loop we've got control.
Control is the systems, the levers, the tools that we
use to affect change hands in a software system today, that's things
like kubernetes and associated tools could be things like service mesh,
configuration management. These are all ways of affecting change.
But there's another half to this feedback loop as well, and that's how we
observe those changes. Observing is another
set of tools we can use to understand what's happening. And it's really an important
part of that feedback loop because it tells us what we should do next.
Right. And I think maybe there's some good reasons
for this, but the amount of investment
that we've seen in the tools on the top half of this loop, I think
has really outpaced these investment and really these innovation
as well on the observability side of things. And I know we've had tools
that allow us to observe our systems for a long time. Maybe we didn't call
them observability tools, but I think with these shift to more loosely
coupled services, with the shifts that we've made to our organization,
it really requires a new way of thinking about how
we observe these systems. And really,
if we haven't adapted those systems, if we haven't innovated on those observability
systems, it's like we've built this amazing car that goes super fast.
We've got a great gas pedal, but we don't have a speedometer. So we have
no way of knowing how fast we're going. And this has consequences at all kinds
of different scales, both at the small scale. When we think about how we're going
to do auto scaling and what are the metrics that we're going to use to
inform that to? How do we decide when to peel off functionality into
a new service, to thinking about the application architecture as a whole?
And often the idea here is that
we have given control and we've built out these control systems to allow
teams to move independently. But what's happened is
that we've lost the ability to understand performance or reliability as a system
as a whole. That kind of brings me to the crux
of what we've done here. Kubernetes and other control systems
like that have given each team more power, right,
more decision making, more control. But that control is now distributed,
right? And DevOps likewise means that each team has more control of their own service
when they push code. But each of those teams depends on a lot of other
teams, right? So say youve the team that's responsible for the service. At the
top of this diagram, you're beholden to
end users, say, to deliver a certain amount of reliability and performance,
and you have control over that service, right? You decide when to roll out new
code, you decide when to roll it back as well. But you depend
on all of these services below you. And ultimately you're also responsible
for the performance of those services, right. If those services are slow,
they're part of the functionality that you provide.
Your service is going to be slow as well. But even though you have responsibility
for that performance, you don't have control. Right? You don't have any way of rolling
back those services other than reaching out to those teams and asking them to do
it. And it's not just other services. These things lower
on the diagram here could be infrastructure, they could be managed services. Things outside
of your organization was, well, but this gap that we've created,
for better or worse, and I think for worse, between control and
responsibility, it's really the textbook definition of stress.
And so kind of what I want to talk about today is how can we
use service ownership to lower that stress and
to fill that gap between control and responsibility?
Okay, how are we going to do that? Well, I'm going to talk a
bunch of specifics, but at a very, very high level. To me,
ownership really has two parts, and the two parts are, I think,
both really important. The first one is accountability, and maybe that's not
super surprising. Of course we give people ownership, we have to hold them accountable
to it. But another really important part of that is giving folks agency,
right, the ability, the means to make things better.
And I'm going to touch on both of these a number of times throughout the
talk. At the same time, when we're talking about a loosely coupled
system, we really need to think about the way that we're observing
that system. And distributed tracing, as you may have guessed,
is going to play a really big role in how we
do that. Okay, so let me dive in and talk a bit about distributed tracing,
and I'll come back and then we'll see how we can build better service ownership
just to get everyone on the same page. Distributed tracing
was sort of built and popularized by a bunch of
larger Internet companies. Web search and
social media companies as a way for them to understand their systems
as they built out a microservice like architecture, even though
we didn't call it that at the time. But just to give you a picture
of what distributed tracing might look like, this is a trace.
So almost any distributed tracing tool will show you something like this,
or like one of these traces. And you can think of these
kind of like a Gantt chart, right? So time is moving from left to
right, and each of these bars represents the work done by some services
as part of handling an end user request. And then
as you got from top to bottom, you're kind of going down the stack,
so you can see where time is being spent in servicing those requests.
As I'll explain a bit more in a minute, though, a trace is really just
a building block of what we can do with distributed tracing.
And so I'll speak a bit more about that, but I want to kind
of dive into what the building block is exactly.
So again, we have these bars that represent these work that's being done.
And you can think of the arrows that are going sort of down the
stack, right? These are the calls that are being made when one
service has delegated responsibility for the request to a service below it.
And likewise the arrows that are coming back up are when those things return.
So you can think of whats the trace is really doing is it's encoding the
causal relationships between callers and colleagues, right. It alerts us know who's
responsible for servicing requests at a given moment in time.
And therefore we can understand where to attribute failures.
We can understand where to attribute things like slowness.
I'm not going to say tools much more about tracing here.
If you want to understand more,
Google put out a paper a while back, got ten years ago
about Dapper Google's internal system that
says a lot of details about what was important at Google and kind of rolling
out the system in some of those use cases, especially about the
mechanics of how they collected and managed those traces.
And even though that's a bit old, I think there's a great kind of perspective
on the history of it. I also want to say I wrote
the book on distributed tracing, so if there's more you want to learn about it,
I strongly recommend this book. I don't get any
royalties from it. We donated all the royalties to a good cause,
but the book can obviously cover a lot more than I can cover today.
From the very basics of distributed tracing, to how to think about costs
and implementation, to getting value from tracing and even what
tracing might offer in the future. Okay, I said traces are just
the building block, though. What do I mean? Well, traces are really the raw material,
right? So distributed traces, those are just, youve can think of them as structs,
right? It's a data type, it's these collection of data, but they're not the
finished product. Right? And distributed tracing is really
that process, that art hands science of deriving value from those traces.
So just to give an example of that at Google,
and you can read about this in the dapper paper, some of the most
valuable uses of things tracing data were not looking at individual ones, but looking
at aggregates. So there was at the time, maybe there still
is a weekly Mapreduce that was run that looked at the
entire collection of traces for web search requests over the past week,
and it detailed how each team hands their service had contributed
to performance for web search. So whether that
service was searching web documents or news or images or video or whatever,
that report would essentially say what percent of the latency was
due to each of those services. And then that can be used to then prioritize
work done by those teams. And in fact, when it wasn't used,
we actually had teams that would go off and they would spend a month or
more time on optimization, which might have improved performance
for their individual service, but had no effect on the overall performance as
observed by users. So again,
distributed tracing, thinking about what the value is, and that's really what I
want to talk about in the context of service ownership today. So service
ownership, let's talk about what I think that means and
some of the benefits and maybe the risks associated with moving forward
and pushing that. So just
to define it, service ownership is really making teams or
responsible for the delivery of their software and their services.
Right. And just to be concrete, it can include responsibilities like
incident response. I think that's what a lot of people think of, but it
also probably involves paying for the infrastructure that those services are
using, for the storage that they might be using. And of course, someone needs
to fix bugs. And I think service ownership is an important part of figuring
out how to triage and allocate bugs. Service ownership,
I think, comes up a lot in the context of DevOps. So I kind of
wanted to spend a minute just to compare those two.
And there's obviously a lot of overlap and
a tight relationship. But DevOps also means a lot of different things to different people.
So DevOps can mean an engineering culture, a culture
that's really based on cooperation between the
people that are developing hands operating software. In some cases they
might be the same people, but not always. DevOps can also mean a set of
tools. So it might mean the latest CI CD tool,
it might mean some of the other infrastructure. Whats you're using in order to provide
a platform for the rest of your organization? I really like
to think a little bit higher. I mean those are great definitions as well,
but I like to think of DevOps as I mentioned, as a feedback loop between
developers and their users, right. And creating a
type feedback loop so that those developers see immediately
what the effects of the code that they're writing and deploying are and
they're able to take that and then use it in terms of how they continue
to do product definition and software developers. So in a lot of ways I think
DevOps can be a bit broader and apply more to a larger
set of processes. And I think if service ownership is really just
thinking about for a team that both developers and
operates, what are the set of responsibilities they have in order to make sure that
they're doing that reliably, hands as
according to their customers expect.
Okay, so what are some of the good things about service ownership?
Well, one thing is by giving teams ownership over their
services, you allow them to be more independent and hopefully that'll raise the developer
velocity for your organization. They'll be able to coordinate
less and focus more on the functionality that they're providing.
It also is a way for organizations to hold engineering
teams accountable and to tie their performance to real business metrics.
I'll talk a bit more about this in a few minutes. But for application
developers, this is really about what their customers are
adopting hands perceiving and how they're using the product. For platform engineers,
it's really thinking about the organization as a whole and how other
application developers, teams within the organization are leveraging tools and
infrastructure and delivering on those promises that they're making to their customers.
Now, there's obviously risks that come along with this was,
well, if you make teams more independent, you allow them to make independent choices,
they might make different choices, right. And you can have divergence.
These, you have more frameworks, more tools that
can have some downsides, right? So one of those downsides is that you have higher
vendor costs. You're not getting the same economy of scale that you might get
if you're just using one tool consistently throughout your organization.
It also might mean that there's more training not only for new team members,
but as developers transfer within teams,
they need to learn a new set of tools, a new set of processes and
thinking about it from the point of view of the organization as a whole or
maybe from a platform engineering team, it's harder to get a sense of the big
picture of your application. So if
we think about how to kind of balance these benefits and
risks and think about those trade offs on both sides,
they come from this idea of this independence, right? That we are allowing teams to
make their own choices. So I think the way that we're going to manage these
trade offs is by allowing for that independence, but at the same
time defining clear responsibilities and goals for those teams.
We allow them to make choices, but we give them some guardrails about how they
can make those choices. And then at the same time it's about
ensuring consistency, right? So maybe there are some kinds of tools that we
allow teams to make independent choices about, but maybe not for other tools.
And when we talk about measuring these results of their work, we want
to make sure that we're doing that consistently and we're measuring progress towards those
goals and we're holding those teams accountable. Okay,
so I think thinking of about service ownership in this context
now, how do we allow for this independence? How do we provide for this consistency?
And like I said, this is going to come back to accountability and agency.
Like I mentioned at the beginning. Okay, how do we drive towards service
ownership? I kind of have three pieces to this puzzle that I
want to talk about each in turn. Hands. Those are documentation oncall.
Not surprising. Hands then service level objectives.
Okay, to start with documentation,
the first step is really creating consistent and centralized documentation
specifically around services in your application. As we grew
at Lightstep and I'm sure was a lot of your organizations have we looked
to define responsibilities for services as the teams got
larger, we split teams as the number of services grew.
But before you can split those responsibilities, I think you need to know who the
experts are. I mean, knowing that is valuable in itself,
but those experts are really going to services as the seeds for
these clusters that will take over ownership of the services.
That documentation is also a way to share that expertise. Right.
Hands. A way for others to find related information that comes in
the form of finding telemetry and dashboards. It could be to find
alert definitions or when an alert fires, to find the playbook that helps you do
that. Hands. One of the things that we found useful was to use a template
for this kind of documentation, right? So youve know that you have some
consistency hands. An engineer or developer knows that when they
go to it, they'll be able to find links to say a dashboard or to
a logs or to traces that help them understand what's happening.
The other thing about a template that's I think, really interesting is that it allows
you to run reports over that and extract information from that documentation,
right? So now we can ask questions like how is expertise divided,
right? Are there certain people that are listed as experts for
more of the service? How does that change over time? And if we've written the
documentation in a totally ad hoc and unstructured way,
it's a really manual process to discover that. But if we've built a kind of
template, a form, we can extract that information much more easily.
The other thing is if we put all the documentation in one place, it's really
easy to audit how often it changes, right? And you can require
periodic updates. You can ask when was the last time for the documentation
for a service x updated? Okay, that was too long ago. Someone on that team
is going to need to be responsible for updating it.
Now, centralized documentation is great. Even better is if you can make it machine readable,
right? So if you do that, you can use
the documentation actually was part of building and
deploying these services, right? So you can use it to generate dashboard config,
you can use it to define escalation policies, hands to define
how deployments work. And that's great one,
because it saves a lot of time, it makes it easier to define new services,
there's less work to do there, but it also makes
documentation necessary as part of the day to day work
of a developer. It makes it necessary for them to get their job done.
And if the only way to add a service to the CI pipeline is
to add it to the documentation, then you can be sure that the documentation is
going to be up to date, right? And that's really what we want. We want
the documentation to be up to date because it's not up to date. People will
lose trust in it and it won't be valuable, they won't go to it,
and there's sort of a downward spiral that we'll be in.
The other thing to think about keeping documentation up to date is just really focusing
on which documentation should be written by humans, right? Not all
of it should be. Some of it really should be dynamic hands.
If we're talking about which team owns a given service, fine, like humans
need to update that. But one of the notorious problems that we
looked to tackle over and over again at Google was try to record service
dependencies and it was just incredibly hard to get teams to do
that because it was constantly changing. It's a function of the software itself,
not of the humans involved. And so asking humans to do that,
I think, was, well, I mean, the conclusion we came to
in the end was that it was never going to work, but doing it programmatically
makes a lot more sense, I think. And that's one of the ways that distributed
tracing can come into play. So this is a service
diagram that I pulled from our own system,
and Aggie is one of our internal services that we run as
part of Lightthep's product. And this is an automatically
generated diagram from a set of traces that tells us the dependencies of that,
not only the immediate ones, but the
transitive dependencies as well. So we can discover dependencies that are two, three, four,
even more hops away. And we can actually annotate that
with other information, like which of those services is actually contributing to latency for
my service. So that, say I just got paged
for latency. Even without any additional information, I can
already have a guess just based upon this kind of dynamic documentation about
where I should start looking. I said,
when it comes to documentation, these wikis are great for people
processes, but don't try hands record information about the software though, right? Because it's just
going to change too fast and it won't be useful.
So as a whole, why is
documentation important? Well, like I said,
it's a shared database of ownership, right? It's about recording who
is accountable in a way that everyone can see. But more than
that, you can also use it to automate a lot of mundane tasks. So having
up to date documentation can also be quite valuable just for reducing toil.
It can also be used to train new team members, obviously. But I
think one thing that was important to us at lightstep in
really improving a lot of our internal documentation was building confidence in
the developer and engineering teams. When we've
tried in previous roles that I've been in,
we've tried to change responsibilities, especially around production
systems. Developers can be pretty unsure hands.
This is kind of going back to this definition of stress.
They want to do a good job, right? They want to be delivering
great service, but if they don't feel like they have the information to do that,
well, that can be a really stressful situation for them. And so having
documentation goes a long way towards building that confidence, towards giving
them that certainty and making them comfortable with those changes in responsibility.
And of course, there's no place that developers
probably feel more stressed. At least many developers is around on
call rotations, right?
Obviously, like I said, one of the most stressful moments for a lot of engineers,
maybe not all of you, but certainly a lot of folks that I've worked with.
And if you're going to establish service ownership,
really, this is one part that you absolutely have to do
right. So just to kind of lay out what
I think oncall can mean, or at least what on calls mean in organizations that
I've worked in, obviously incident response is a big piece of that,
but I think not the only one. And like I said,
this might not apply to every organizations, but at least in one organization
I've worked in, Oncall has been responsible for a bunch of other things as well.
So one of those is communicating status internally within these
organization and externally to customers. Oncall is often responsible for
managing changes within production, whether that's deploying new code themselves
or being kind of a traffic cop for deployments or breaking
other infrastructure changes within the production environment. On call
is often responsible for sort of passively monitoring dashboards
and also handling low urgency alerts, customer requests,
and other kinds of interrupt driven work. In one role
we thought of on call was just the person who's getting interrupted all the time,
and they ended up just getting all the interruptions. But in addition to
that, they're also responsible for handoffs between oncall
shifts. So transferring information to the next on call,
and in the case where there are incidents, writing post mortem
so that we can address those hands. So thinking about how to
improve all these and to do these well, I think is really going to be
critical to doing service ownership well. So I wanted to
kind of start with incident response, since that's
certainly the biggest one. And if you think about
service ownership, yeah, doing this well is really going to be important. And there's a
lot of ways that we can do incident response well or improve incident response
as it exists today. One of those is making pages more actionable,
making it easier to mitigate those problems
or ignore them if they're not problems. Another one is to deliver pages to the
right teams. And finally, we can also just reduce the number of pages overall.
So I said we can
make alerts more actionable. Really, that's about understanding root
causes, right? Like how do we get more quickly to what the root cause
is, or root causes are so that we can take action to address and mitigate
those things? And one of the things that we found
to be really useful at lightstep is to actually annotate
alerts, not only with what the condition,
obviously that was triggered was but to add in additional information
that helps us understand why that happened. Right. And so
if I were to receive this page, I know that latency has gone
up, but if I click on this link here, I also get an example.
This is evidence of latency going up. This is a slow request,
and now I can look and try to understand what's happening not only in this
service, but in services that are deeper down the stack. And maybe in
this case, I can look and see that the work that's being done
by the service is actually being sharded. Right. It's divided into a bunch of
pieces, and it turns out that those shards are not very equally balanced.
One of them is taking a lot longer, and that's really what's driving up latency
in this case. So being able to do this root cause analysis quickly without
digging through lots of information is a way to improve that experience of oncall.
Of course, youve know, even better than having to dig
through a bunch of that stuff is making sure that the right people are involved.
Just from the beginning. I know one of the teams that I worked at,
at Google, we were relatively high on the stack, and when we would get paged,
often the only thing that we could do was to turn around and page another
team to tell these that it was actually their problem. And I
wanted to give credit. This is from a talk that Luis Monero
did last year at Srecon, based upon some work
that he and others did at Zalando, which is an e commerce company based in
Europe. And I think it's really cool work.
So, like I said, at Google, my team was often responsible for sort of
page routing in a way, which is a horrible thing for a human
to do, especially at 03:00 in the morning. And so what
they've built at Zalando is actually programmatically doing that
routing. So they still alert based upon symptoms, right, as you should alert
based upon things that their end users are observing. But what
they've done is that when that happens, they actually look at traces from
the application itself, and if there's an error that triggered that alert,
they look at all of the immediate dependencies of the service
that triggered the alert and say, do any of those dependencies also show
errors in this trace? If yes, repeat and go and look at each
of their dependencies. If any of those dependencies have errors, repeat, go and look
at the next service down the stack and keep going until we find a service
or services that don't have any immediate dependencies with errors.
And then page those teams, they found that this is the best place to start.
It might not always be the right place, but it's better than starting at the
top of the stack and going down one service at a time.
And yeah, like I said, I would have loved to have this kind of thing
on the team that I worked at. It's a great way to get information
to the right people. And like I said, this is sort of a function of
the way that we've distributed ownership and the way that we've
distributed the code itself. Right hands. We've broken apart
the application to these more loosely coupled parts. The trace is really
critical to understanding how to respond to these events.
I want to touch briefly on one other part of being on call hands,
that's writing, sharing and reviewing postmortems.
Postmortems, I think are really important part, even if they're not sort of the
same adrenaline rush that being paged is.
But it's really about repeating issues that might
come up again hands, maybe more importantly about improving responses,
because the same issues sre not always going to come up over and
again. So how can we respond better to a novel issue next time?
And for post mortems to really be blameless,
establishing what happened in an objective way is really important.
And I've seen again and again that doing this through real telemetry,
especially in a distributed system, using tracing, is really important.
So I can think of a number of times
when in the writing or the reviewing of a post mortem,
there is essentially a disagreement about whose fault a latency problem is.
Right? Is it service a is making an incorrect call
or is configured incorrectly? Or is it service b is too slow in servicing that
request? And if you look at aggregates, if you're
just looking at something like p 50 latency,
those two teams can have a pretty different perspective of what's going on, especially if
they're not accounting for things like the network in between, and if they're not
really making sure that they're pairing up slow requests on one side with
the same kind of corresponding requests on the other side. And what tracing helps
you do is really understand those causal relationships, right? It allows you to
pair up a slow request on one side with one service
with the response that was part of that request on the other side, and really
understand if that slowness is responsible there.
And look at the logs, look at the request parameters to understand
what service needs to change in order to improve things.
So yeah, obviously improving on call is important,
not just for the obvious reasons, right. That whats
has real impact on your, on your customers experience and on revenue and reputation
and things like that. But it has a cost internally as well,
right? Because time spent handling pages, writing post mortems,
handling those interrupts, that's time that developers and
engineers are not spending building new features or doing proactive optimization.
Right? So there's a cost to that. And then the
stress of being on call has a major impact, I think, on job satisfaction
for a lot of developers. And so thinking
about that stress that can be mitigated by
improving on call, it can be mitigated by having good documentation.
And I mentioned reducing the number of pages is also a great
way of improving on call. Like giving teams the agency
to do that. Right? Like giving teams the agency to say, hey, look, this alert
is not valuable, right? It's not helping us meet
our goals and so we want to delete it. And that's actually going
to make our lives better and make us more productive.
But like I said, we need to understand their goals, right? So how do we
think about holding teams accountable for on call? Like what are the goals
in a way that we can measure? Right. Well, that brings me to my next
topic, which is to talk about Slos. So service level
objectives, again, I'm just going to give kind of
a whirlwind kind of intro to these. There's a lot
more that could be said, obviously, but these are promises that service owners make
to their customers, right? And those could both be
internal customers, these other people within your organization, or end users people external
to your organization. And what's important about an slos is that it's stated
in a way that can be measured on relatively short timescales.
So to give an example of what an SLO looks like,
it might be something like 99th percentile latency should be less than 5
seconds over the last five minutes. And to kind of break this down.
So the first part is the service level indicator. That's the metric,
these thing you're measuring, right. The second part is the threshold.
That's kind of the goal in a way. And usually this is expressed as an
inequality, right. We want to keep latency down and then finally
we have the evaluation window. And I'll say a bit more about that
in a second, but that's really important for making sure that we're measuring things in
a consistent and precise way. So just to give some examples of
other sorts of indicators. So I mentioned latency. You might choose different
percentiles depending on what's important to your customers. You might
measure error rate. That's important for a lot of folks. Availability is often
something whats is promised to customers as well. Depending on your business.
You might also measure something like durability or throughput as well.
So I mentioned the way that you measure
these Slis is important and things idea of
a window. So when we look at a dashboard that's
showing something like latency, usually what that's showing is what you might call instantaneous
latency. And that's good. That's usually these
default and that's what we want to see when we're in the middle of an
incident. Right. Because that's going to be the most responsive way of measuring this.
But if you're trying to measure an slO, the problem with instantaneous
latency is if you look on narrower and narrower timescales,
it can actually significantly change the value of it. And if there's
one thing that's important about slos, it's that we all agree on what
the definition is and whats we're all measuring it in the same way.
And so when we look at something like latency for an
SLO, we're really going to talk about measuring it over something like the last five
minutes or over a five minute window. And really what that's doing is looking at
all of the requests over that five minute window. If we're looking at P 99,
then looking at the fastest 99% of
those requests hands, making sure that all of them fit under some threshold.
Okay, great. So how do we determine
slos? Well, there's a bunch of questions that you need to ask yourself.
The first one is, what do your customers expect? What have you promised them already?
Right. You might be legally bound to provide a certain level
of service, or it might just be that there's an expectation and you can
measure conversion and things like
that. To understand that users get bored and leave if
it takes too long to service requests. Youve should also ask what you can provide
today, right. There's no reason to set an Slo that you're not going
to be able to meet or that you're not going to be able to meet
anytime soon. And so thinking about what is the
product roadmap look like? How much time do we have on
the engineering team to make changes to improve performance or reliability and
making sure that these all line up so that we're doing the best we can
for our customers while providing the functionality that they need at the same time.
Okay, so how do we actually do
that? Right. Let me take a really small, simple example. So say
here's a simple microservice
based application, just three services in this case,
and say whats we've promised our customers that
will serve requests 99%
of the time within 5 seconds. And of course under some evaluation
window and for service a,
the one that's, that's labeled a at the top here, that sort of translates
immediately to what they're on the hook to provide.
But what about internal services? Right? How should this map to
service be? Right, so let's look at a trace, right?
So how does a request actually flow through these things?
And you probably want to look at more than one trace,
in fact. But I just pulled out one here just as an example.
So now that we see this, we can see, it looks
like today, at least in this example, service b is actually responsible for
a lot of the latency of service a. So we can also
give a kind of similar bound to service b in a lot of ways.
That is, it also needs to be able to serve p 99 latency in less
than 5 seconds. But what's interesting is that, sure,
in the kind of services diagram, there's one arrow between b and c,
but in this request, there's actually two requests from b to c that happen
in serial, which means that we need c to
be twice as fast. Right? So maybe things is what
you were thinking, whats p 99 latency for
C needs to be less than two and a half seconds. If you think
about it, maybe for another minute, you realize that that's not quite
correct either. In fact, there's two chances for C to
fail in this case as well, right? So there's two chances for C
to serve in a server request in more than two and a half seconds.
So we actually need the bound to be even tighter than that. It's around 99.5
percentile latency, and that's
sort of how we can pass that down to c.
Now what's kind of interesting in this is it might be that in some other
cases that b also depends on another service D.
But at least in terms of this request, in terms of servicing
a request that came from A, B doesn't depend
on D at all, right? And so thinking about D's slos,
actually, we don't have any information to do that from this case. So looking at
traces is really important. It's not just enough to look at the service diagram.
The trace is really going to tell you what's going to help you there.
Okay, so why are slos important? They are
really about measuring success in delivering a service, they're about measuring success for on
call. Right. These teams can use them as a guide
to prioritize work. So if we've established an slo, we can now
understand how much improvement we need to make and we can use that to
trade off against, say, new feature development.
And it's a way of really holding
teams accountable consistently across your right. So you want to make sure
that as folks move from one team to another that they're not learning new ways
of doing this. And if you're going to measure teams performance
by their ability to meet their slos, it's really important that you do that consistently
as well. And then
finally, yeah, these are all things ways about thinking about
accountability. But agency is also really important too. And slos
are really a way of giving folks a budget for thinking about how much room
they have to push more
deployments out there. Right? Like how close are they to hitting their slos?
And that's really a way for them to build can error budget as well.
Okay, so just kind of review my three piece
puzzle here. So documentation obviously is
an important part of that. I think more important than just documentation for documentation's sake.
But it's a way of establishing ownership hands knowing who is going to be held
accountable. But if you're going to do that, it absolutely has to
be up to date, right? You can't hold people accountable to documentation
based upon documentation that's out of date.
It's also really critical in building confidence within those teams
hands. Along with tools
that describe the dynamic state of a system. It's critical information for
folks that are on call or need to understand how a system is actually behaving
on call. Obviously youve can't do business without it.
Incident management is often the part that people think about
most. When you say service ownership, but I
want to call it that, on call has a lot of other components to it
too. A lot of those are really tools and hands.
Finally, Slos, right. These are really like how you hold teams
accountable, like I said, how you measure their success hands
in all of this. I think in a system where you have a loosely coupled
architecture where you have teams, whats are moving independently, tracing is really
critical to understanding causality in that system, to understanding who
is responsible for at a given moment in time, which services are actually contributing
to latency. And if you don't have that information, you're not going to be able
to keep your documentation up to date. You're not going to be able to make
good decisions while you're on call and you're not going to be able to set
slos in a way that actually reflects what your customers expect.
Okay, so I mentioned error budgets.
That's really just one kind of budget. And I think
giving folks budget to improve reliability and giving them agency
to do that will help them hit their goals and will lower their stress.
But that agency requires them having the right information and the time to do it.
And so this really comes down to ownership doesn't come for free. You've got to
give your teams time to actually invest.
You've got to give them time to improve and to make things better.
Okay, so, sounds great. How do we get this right? Where do we start?
Well, making changes
in a DevOps organization, it's hard,
right? Rolling out new tools and new processes always has to
be a bottom up thing.
And whether that's how you run your sprints, which tools
you choose to do, developers, what observability tools you use.
If they're going to be adopted, they really have to provide value to
those application development teams. And ideally more than
provide value, they would be a necessary part of their day to day work.
If you don't have those things, at least in my experience, it's just
going to be a long, long uphill road to get those
things deployed and adopted.
Then to establish hands, maintain service ownership, use a
communication of documentation, on call process and slos, and manufacture
a need for those tools, hands processes where necessary. Right. So what I mean by
that is just to say make it a requirement to have service
ownership defined within the documentation before a service can
be defined, before it can be part of the deployment pipeline, like I mentioned.
So as a platform team, as part of
engineering leadership, you have the ability to actually make
these processes required in a way, and if you do that, that'll actually go a
long way towards them being adopted hands becoming part of the
tool set of the folks in your organization.
Okay, so just to kind of sum up, I think of ownership is
really having two parts. Obviously, accountability is a big
part of it. I think that's what folks think about a lot when they think
about ownership hands. That's really setting the
deliverables and the goals for the owners within your organization
and making sure that youve evaluating their performance based upon those goals and deliverables.
Right. That's really how youve make those things sink
in. And I think a second and equally important part is to give
those teams, agency agency to make change. Right.
So they're going to be a lot more inclined and a lot happier oncall
if they're able to control and make changes to the kinds of alerts they get,
if they're able to make changes to the architecture itself. Right. So making sure
that you're offering them the information nation, allowing them to build confidence
and giving them these budget to improve is really critical, I think,
in establishing service ownership. So with
that, I wanted to thank everyone for your attention. You can find me
at Dave Spoons on Twitter. You can find me@lightstep.com
I'm always excited to talk about service ownership. I'm always excited to talk about distributed
tracing.