Conf42 Site Reliability Engineering 2020 - Online

- premiere 5PM GMT

Driving Service Ownership with Distributed Tracing

Video size:

Abstract

While many organizations are rolling out Kubernetes, breaking up their monoliths, and adopting DevOps practices with the hope of increasing developer velocity and improving reliability, it’s not enough just to put these tools in the hands of developers: you’ve got to incentivize developers to use them. Service ownership provides these incentives, by holding teams accountable for metrics like the performance and reliability of their services as well as by giving them the agency to improve those metrics.

In this talk, I’ll cover how distributed tracing can serve as the backbone of service ownership. For SRE teams that are setting standards for their organizations, it can help drive things like documentation, communication, on-call processes, and SLOs by providing a single source of truth for what’s happening across the entire application. For embedded SRE teams, it can also accelerate root cause analysis and make alerts more actionable by showing developers what’s changed – even if that change was a dozen services away. Throughout the talk, I’ll use examples drawn from more than a decade of experience with SRE teams in organizations big and small.

Summary

  • Daniel Spoonhauer is CTO and co founder of Lightstep. He talks about driving service ownership with distributed tracing. Before Lightstep, he was a software engineer at Google.
  • Many engineering organizations are adopting services or other kinds of similar systems. What does that mean for how we work together? We've also changed the way that we build these systems. With these shift to more loosely coupled services, it requires a new way of thinking about how we observe systems.
  • distributed tracing is really that process, that art hands science of deriving value from those traces. I also want to say I wrote the book on distributed tracing, so if there's more you want to learn about it, I strongly recommend this book.
  • Service ownership is really making teams or responsible for the delivery of their software and their services. It can include responsibilities like incident response. By giving teams ownership over their services, you allow them to be more independent. But there are risks that come along with this.
  • The first step is creating consistent and centralized documentation specifically around services in your application. That documentation is also a way to share that expertise. The other thing to think about keeping documentation up to date is focusing on which documentation should be written by humans.
  • distributed tracing can come into play. Having up to date documentation can also be quite valuable just for reducing toil. But more than that, you can also use it to automate a lot of mundane tasks.
  • Documentation builds confidence in the developer and engineering teams. One of the most stressful moments for engineers is on call rotations. If you're going to establish service ownership, really, this is one part that you absolutely have to do right.
  • Oncall is often responsible for managing changes within production. There's a lot of ways that we can do incident response well or improve incident response as it exists today. Making pages more actionable, making it easier to mitigate problems. And finally, we can also just reduce the number of pages overall.
  • improving on call is important, not just for the obvious reasons. But it has a cost internally as well, right? Time spent handling pages, writing post mortems, handling those interrupts, that's time that developers and engineers are not spending building new features or doing proactive optimization.
  • Service level objectives are promises that service owners make to their customers. What's important about an slos is that it's stated in a way that can be measured on relatively short timescales. How do we determine slos?
  • Documentation is a way of establishing ownership hands. It's critical information for folks that are on call or need to understand how a system is behaving on call. Give teams agency to make changes to improve reliability. Rolling out new tools and processes must be a bottom up thing.
  • So with that, I wanted to thank everyone for your attention. You can find me at Dave Spoons on Twitter. I'm always excited to talk about service ownership. Thanks.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Spoons and thanks for joining. I'm here to talk about driving service ownership with distributed tracing. Before I get started, just a little bit about myself. So my name is Spoons. Although my full name is Daniel Spoonhauer, no one calls me that. I'm CTO and a co founder at Lightstep, where we provide simple observability for deep systems based on distributed tracing. And I spend a lot of time at Lightstep thinking and working on both service ownership and, not surprisingly, distributed tracing. Before I helped found Lightstep, I was a software engineer at Google. I worked both on Google's internal infrastructure team and as part of Google's cloud platform team. I worked closely with SRE in both cases to build processes and roll out tools to improve reliability and reduce the amount of work for teams, and in both cases, hands. Subsequently, at Lightstep, I carried a pager for many years, although they don't let me do that anymore. So great. So before I get too deep, just to kind of set some context, I want to talk about this question. What changed? I think this is a really important question generally for SRE, but I want to talk about what changed in the kinds of technologies that we use, the kinds of architectures that we build. And these was that we work together as well. Many engineering organizations are adopting services or other kinds of similar systems, often aside DevOps practices. But what does that mean for how we work together? So there's been a series of technical changes as we move from bare metal to virtual machines, to containers, and finally to orchestration tools like kubernetes. And each of these has provided additional abstractions, right? They've raised the level of the primitives that we can work with. They've also introduced additional complexity. Hands more was for those systems to fail. So there's a trade off there. And of course, even though we have those abstractions, someone still needs to understand what's happening underneath the hood. Partly enabled by these. We've also changed these way that we build these systems, right? So we've moved to microservices, maybe we're using serverless. These are all ways of building more loosely coupled applications where different pieces of it can be deployed independently, can be scaled independently. And speaking of independently, we've also adopting methodologies like agile and DevOps. I'll talk a bit more about DevOps, but I really think of DevOps as a way to allow teams to work more autonomously, more independently, and to boost developer velocity. And while that's good in a lot of ways, that's also created some challenges. So I like to put this kind of in the form of this feedback loop, right? So if we look at the bigger picture in any software system, and really this goes for a whole bunch of other systems as well, we have a kind of feedback loop, and on one half of these loop we've got control. Control is the systems, the levers, the tools that we use to affect change hands in a software system today, that's things like kubernetes and associated tools could be things like service mesh, configuration management. These are all ways of affecting change. But there's another half to this feedback loop as well, and that's how we observe those changes. Observing is another set of tools we can use to understand what's happening. And it's really an important part of that feedback loop because it tells us what we should do next. Right. And I think maybe there's some good reasons for this, but the amount of investment that we've seen in the tools on the top half of this loop, I think has really outpaced these investment and really these innovation as well on the observability side of things. And I know we've had tools that allow us to observe our systems for a long time. Maybe we didn't call them observability tools, but I think with these shift to more loosely coupled services, with the shifts that we've made to our organization, it really requires a new way of thinking about how we observe these systems. And really, if we haven't adapted those systems, if we haven't innovated on those observability systems, it's like we've built this amazing car that goes super fast. We've got a great gas pedal, but we don't have a speedometer. So we have no way of knowing how fast we're going. And this has consequences at all kinds of different scales, both at the small scale. When we think about how we're going to do auto scaling and what are the metrics that we're going to use to inform that to? How do we decide when to peel off functionality into a new service, to thinking about the application architecture as a whole? And often the idea here is that we have given control and we've built out these control systems to allow teams to move independently. But what's happened is that we've lost the ability to understand performance or reliability as a system as a whole. That kind of brings me to the crux of what we've done here. Kubernetes and other control systems like that have given each team more power, right, more decision making, more control. But that control is now distributed, right? And DevOps likewise means that each team has more control of their own service when they push code. But each of those teams depends on a lot of other teams, right? So say youve the team that's responsible for the service. At the top of this diagram, you're beholden to end users, say, to deliver a certain amount of reliability and performance, and you have control over that service, right? You decide when to roll out new code, you decide when to roll it back as well. But you depend on all of these services below you. And ultimately you're also responsible for the performance of those services, right. If those services are slow, they're part of the functionality that you provide. Your service is going to be slow as well. But even though you have responsibility for that performance, you don't have control. Right? You don't have any way of rolling back those services other than reaching out to those teams and asking them to do it. And it's not just other services. These things lower on the diagram here could be infrastructure, they could be managed services. Things outside of your organization was, well, but this gap that we've created, for better or worse, and I think for worse, between control and responsibility, it's really the textbook definition of stress. And so kind of what I want to talk about today is how can we use service ownership to lower that stress and to fill that gap between control and responsibility? Okay, how are we going to do that? Well, I'm going to talk a bunch of specifics, but at a very, very high level. To me, ownership really has two parts, and the two parts are, I think, both really important. The first one is accountability, and maybe that's not super surprising. Of course we give people ownership, we have to hold them accountable to it. But another really important part of that is giving folks agency, right, the ability, the means to make things better. And I'm going to touch on both of these a number of times throughout the talk. At the same time, when we're talking about a loosely coupled system, we really need to think about the way that we're observing that system. And distributed tracing, as you may have guessed, is going to play a really big role in how we do that. Okay, so let me dive in and talk a bit about distributed tracing, and I'll come back and then we'll see how we can build better service ownership just to get everyone on the same page. Distributed tracing was sort of built and popularized by a bunch of larger Internet companies. Web search and social media companies as a way for them to understand their systems as they built out a microservice like architecture, even though we didn't call it that at the time. But just to give you a picture of what distributed tracing might look like, this is a trace. So almost any distributed tracing tool will show you something like this, or like one of these traces. And you can think of these kind of like a Gantt chart, right? So time is moving from left to right, and each of these bars represents the work done by some services as part of handling an end user request. And then as you got from top to bottom, you're kind of going down the stack, so you can see where time is being spent in servicing those requests. As I'll explain a bit more in a minute, though, a trace is really just a building block of what we can do with distributed tracing. And so I'll speak a bit more about that, but I want to kind of dive into what the building block is exactly. So again, we have these bars that represent these work that's being done. And you can think of the arrows that are going sort of down the stack, right? These are the calls that are being made when one service has delegated responsibility for the request to a service below it. And likewise the arrows that are coming back up are when those things return. So you can think of whats the trace is really doing is it's encoding the causal relationships between callers and colleagues, right. It alerts us know who's responsible for servicing requests at a given moment in time. And therefore we can understand where to attribute failures. We can understand where to attribute things like slowness. I'm not going to say tools much more about tracing here. If you want to understand more, Google put out a paper a while back, got ten years ago about Dapper Google's internal system that says a lot of details about what was important at Google and kind of rolling out the system in some of those use cases, especially about the mechanics of how they collected and managed those traces. And even though that's a bit old, I think there's a great kind of perspective on the history of it. I also want to say I wrote the book on distributed tracing, so if there's more you want to learn about it, I strongly recommend this book. I don't get any royalties from it. We donated all the royalties to a good cause, but the book can obviously cover a lot more than I can cover today. From the very basics of distributed tracing, to how to think about costs and implementation, to getting value from tracing and even what tracing might offer in the future. Okay, I said traces are just the building block, though. What do I mean? Well, traces are really the raw material, right? So distributed traces, those are just, youve can think of them as structs, right? It's a data type, it's these collection of data, but they're not the finished product. Right? And distributed tracing is really that process, that art hands science of deriving value from those traces. So just to give an example of that at Google, and you can read about this in the dapper paper, some of the most valuable uses of things tracing data were not looking at individual ones, but looking at aggregates. So there was at the time, maybe there still is a weekly Mapreduce that was run that looked at the entire collection of traces for web search requests over the past week, and it detailed how each team hands their service had contributed to performance for web search. So whether that service was searching web documents or news or images or video or whatever, that report would essentially say what percent of the latency was due to each of those services. And then that can be used to then prioritize work done by those teams. And in fact, when it wasn't used, we actually had teams that would go off and they would spend a month or more time on optimization, which might have improved performance for their individual service, but had no effect on the overall performance as observed by users. So again, distributed tracing, thinking about what the value is, and that's really what I want to talk about in the context of service ownership today. So service ownership, let's talk about what I think that means and some of the benefits and maybe the risks associated with moving forward and pushing that. So just to define it, service ownership is really making teams or responsible for the delivery of their software and their services. Right. And just to be concrete, it can include responsibilities like incident response. I think that's what a lot of people think of, but it also probably involves paying for the infrastructure that those services are using, for the storage that they might be using. And of course, someone needs to fix bugs. And I think service ownership is an important part of figuring out how to triage and allocate bugs. Service ownership, I think, comes up a lot in the context of DevOps. So I kind of wanted to spend a minute just to compare those two. And there's obviously a lot of overlap and a tight relationship. But DevOps also means a lot of different things to different people. So DevOps can mean an engineering culture, a culture that's really based on cooperation between the people that are developing hands operating software. In some cases they might be the same people, but not always. DevOps can also mean a set of tools. So it might mean the latest CI CD tool, it might mean some of the other infrastructure. Whats you're using in order to provide a platform for the rest of your organization? I really like to think a little bit higher. I mean those are great definitions as well, but I like to think of DevOps as I mentioned, as a feedback loop between developers and their users, right. And creating a type feedback loop so that those developers see immediately what the effects of the code that they're writing and deploying are and they're able to take that and then use it in terms of how they continue to do product definition and software developers. So in a lot of ways I think DevOps can be a bit broader and apply more to a larger set of processes. And I think if service ownership is really just thinking about for a team that both developers and operates, what are the set of responsibilities they have in order to make sure that they're doing that reliably, hands as according to their customers expect. Okay, so what are some of the good things about service ownership? Well, one thing is by giving teams ownership over their services, you allow them to be more independent and hopefully that'll raise the developer velocity for your organization. They'll be able to coordinate less and focus more on the functionality that they're providing. It also is a way for organizations to hold engineering teams accountable and to tie their performance to real business metrics. I'll talk a bit more about this in a few minutes. But for application developers, this is really about what their customers are adopting hands perceiving and how they're using the product. For platform engineers, it's really thinking about the organization as a whole and how other application developers, teams within the organization are leveraging tools and infrastructure and delivering on those promises that they're making to their customers. Now, there's obviously risks that come along with this was, well, if you make teams more independent, you allow them to make independent choices, they might make different choices, right. And you can have divergence. These, you have more frameworks, more tools that can have some downsides, right? So one of those downsides is that you have higher vendor costs. You're not getting the same economy of scale that you might get if you're just using one tool consistently throughout your organization. It also might mean that there's more training not only for new team members, but as developers transfer within teams, they need to learn a new set of tools, a new set of processes and thinking about it from the point of view of the organization as a whole or maybe from a platform engineering team, it's harder to get a sense of the big picture of your application. So if we think about how to kind of balance these benefits and risks and think about those trade offs on both sides, they come from this idea of this independence, right? That we are allowing teams to make their own choices. So I think the way that we're going to manage these trade offs is by allowing for that independence, but at the same time defining clear responsibilities and goals for those teams. We allow them to make choices, but we give them some guardrails about how they can make those choices. And then at the same time it's about ensuring consistency, right? So maybe there are some kinds of tools that we allow teams to make independent choices about, but maybe not for other tools. And when we talk about measuring these results of their work, we want to make sure that we're doing that consistently and we're measuring progress towards those goals and we're holding those teams accountable. Okay, so I think thinking of about service ownership in this context now, how do we allow for this independence? How do we provide for this consistency? And like I said, this is going to come back to accountability and agency. Like I mentioned at the beginning. Okay, how do we drive towards service ownership? I kind of have three pieces to this puzzle that I want to talk about each in turn. Hands. Those are documentation oncall. Not surprising. Hands then service level objectives. Okay, to start with documentation, the first step is really creating consistent and centralized documentation specifically around services in your application. As we grew at Lightstep and I'm sure was a lot of your organizations have we looked to define responsibilities for services as the teams got larger, we split teams as the number of services grew. But before you can split those responsibilities, I think you need to know who the experts are. I mean, knowing that is valuable in itself, but those experts are really going to services as the seeds for these clusters that will take over ownership of the services. That documentation is also a way to share that expertise. Right. Hands. A way for others to find related information that comes in the form of finding telemetry and dashboards. It could be to find alert definitions or when an alert fires, to find the playbook that helps you do that. Hands. One of the things that we found useful was to use a template for this kind of documentation, right? So youve know that you have some consistency hands. An engineer or developer knows that when they go to it, they'll be able to find links to say a dashboard or to a logs or to traces that help them understand what's happening. The other thing about a template that's I think, really interesting is that it allows you to run reports over that and extract information from that documentation, right? So now we can ask questions like how is expertise divided, right? Are there certain people that are listed as experts for more of the service? How does that change over time? And if we've written the documentation in a totally ad hoc and unstructured way, it's a really manual process to discover that. But if we've built a kind of template, a form, we can extract that information much more easily. The other thing is if we put all the documentation in one place, it's really easy to audit how often it changes, right? And you can require periodic updates. You can ask when was the last time for the documentation for a service x updated? Okay, that was too long ago. Someone on that team is going to need to be responsible for updating it. Now, centralized documentation is great. Even better is if you can make it machine readable, right? So if you do that, you can use the documentation actually was part of building and deploying these services, right? So you can use it to generate dashboard config, you can use it to define escalation policies, hands to define how deployments work. And that's great one, because it saves a lot of time, it makes it easier to define new services, there's less work to do there, but it also makes documentation necessary as part of the day to day work of a developer. It makes it necessary for them to get their job done. And if the only way to add a service to the CI pipeline is to add it to the documentation, then you can be sure that the documentation is going to be up to date, right? And that's really what we want. We want the documentation to be up to date because it's not up to date. People will lose trust in it and it won't be valuable, they won't go to it, and there's sort of a downward spiral that we'll be in. The other thing to think about keeping documentation up to date is just really focusing on which documentation should be written by humans, right? Not all of it should be. Some of it really should be dynamic hands. If we're talking about which team owns a given service, fine, like humans need to update that. But one of the notorious problems that we looked to tackle over and over again at Google was try to record service dependencies and it was just incredibly hard to get teams to do that because it was constantly changing. It's a function of the software itself, not of the humans involved. And so asking humans to do that, I think, was, well, I mean, the conclusion we came to in the end was that it was never going to work, but doing it programmatically makes a lot more sense, I think. And that's one of the ways that distributed tracing can come into play. So this is a service diagram that I pulled from our own system, and Aggie is one of our internal services that we run as part of Lightthep's product. And this is an automatically generated diagram from a set of traces that tells us the dependencies of that, not only the immediate ones, but the transitive dependencies as well. So we can discover dependencies that are two, three, four, even more hops away. And we can actually annotate that with other information, like which of those services is actually contributing to latency for my service. So that, say I just got paged for latency. Even without any additional information, I can already have a guess just based upon this kind of dynamic documentation about where I should start looking. I said, when it comes to documentation, these wikis are great for people processes, but don't try hands record information about the software though, right? Because it's just going to change too fast and it won't be useful. So as a whole, why is documentation important? Well, like I said, it's a shared database of ownership, right? It's about recording who is accountable in a way that everyone can see. But more than that, you can also use it to automate a lot of mundane tasks. So having up to date documentation can also be quite valuable just for reducing toil. It can also be used to train new team members, obviously. But I think one thing that was important to us at lightstep in really improving a lot of our internal documentation was building confidence in the developer and engineering teams. When we've tried in previous roles that I've been in, we've tried to change responsibilities, especially around production systems. Developers can be pretty unsure hands. This is kind of going back to this definition of stress. They want to do a good job, right? They want to be delivering great service, but if they don't feel like they have the information to do that, well, that can be a really stressful situation for them. And so having documentation goes a long way towards building that confidence, towards giving them that certainty and making them comfortable with those changes in responsibility. And of course, there's no place that developers probably feel more stressed. At least many developers is around on call rotations, right? Obviously, like I said, one of the most stressful moments for a lot of engineers, maybe not all of you, but certainly a lot of folks that I've worked with. And if you're going to establish service ownership, really, this is one part that you absolutely have to do right. So just to kind of lay out what I think oncall can mean, or at least what on calls mean in organizations that I've worked in, obviously incident response is a big piece of that, but I think not the only one. And like I said, this might not apply to every organizations, but at least in one organization I've worked in, Oncall has been responsible for a bunch of other things as well. So one of those is communicating status internally within these organization and externally to customers. Oncall is often responsible for managing changes within production, whether that's deploying new code themselves or being kind of a traffic cop for deployments or breaking other infrastructure changes within the production environment. On call is often responsible for sort of passively monitoring dashboards and also handling low urgency alerts, customer requests, and other kinds of interrupt driven work. In one role we thought of on call was just the person who's getting interrupted all the time, and they ended up just getting all the interruptions. But in addition to that, they're also responsible for handoffs between oncall shifts. So transferring information to the next on call, and in the case where there are incidents, writing post mortem so that we can address those hands. So thinking about how to improve all these and to do these well, I think is really going to be critical to doing service ownership well. So I wanted to kind of start with incident response, since that's certainly the biggest one. And if you think about service ownership, yeah, doing this well is really going to be important. And there's a lot of ways that we can do incident response well or improve incident response as it exists today. One of those is making pages more actionable, making it easier to mitigate those problems or ignore them if they're not problems. Another one is to deliver pages to the right teams. And finally, we can also just reduce the number of pages overall. So I said we can make alerts more actionable. Really, that's about understanding root causes, right? Like how do we get more quickly to what the root cause is, or root causes are so that we can take action to address and mitigate those things? And one of the things that we found to be really useful at lightstep is to actually annotate alerts, not only with what the condition, obviously that was triggered was but to add in additional information that helps us understand why that happened. Right. And so if I were to receive this page, I know that latency has gone up, but if I click on this link here, I also get an example. This is evidence of latency going up. This is a slow request, and now I can look and try to understand what's happening not only in this service, but in services that are deeper down the stack. And maybe in this case, I can look and see that the work that's being done by the service is actually being sharded. Right. It's divided into a bunch of pieces, and it turns out that those shards are not very equally balanced. One of them is taking a lot longer, and that's really what's driving up latency in this case. So being able to do this root cause analysis quickly without digging through lots of information is a way to improve that experience of oncall. Of course, youve know, even better than having to dig through a bunch of that stuff is making sure that the right people are involved. Just from the beginning. I know one of the teams that I worked at, at Google, we were relatively high on the stack, and when we would get paged, often the only thing that we could do was to turn around and page another team to tell these that it was actually their problem. And I wanted to give credit. This is from a talk that Luis Monero did last year at Srecon, based upon some work that he and others did at Zalando, which is an e commerce company based in Europe. And I think it's really cool work. So, like I said, at Google, my team was often responsible for sort of page routing in a way, which is a horrible thing for a human to do, especially at 03:00 in the morning. And so what they've built at Zalando is actually programmatically doing that routing. So they still alert based upon symptoms, right, as you should alert based upon things that their end users are observing. But what they've done is that when that happens, they actually look at traces from the application itself, and if there's an error that triggered that alert, they look at all of the immediate dependencies of the service that triggered the alert and say, do any of those dependencies also show errors in this trace? If yes, repeat and go and look at each of their dependencies. If any of those dependencies have errors, repeat, go and look at the next service down the stack and keep going until we find a service or services that don't have any immediate dependencies with errors. And then page those teams, they found that this is the best place to start. It might not always be the right place, but it's better than starting at the top of the stack and going down one service at a time. And yeah, like I said, I would have loved to have this kind of thing on the team that I worked at. It's a great way to get information to the right people. And like I said, this is sort of a function of the way that we've distributed ownership and the way that we've distributed the code itself. Right hands. We've broken apart the application to these more loosely coupled parts. The trace is really critical to understanding how to respond to these events. I want to touch briefly on one other part of being on call hands, that's writing, sharing and reviewing postmortems. Postmortems, I think are really important part, even if they're not sort of the same adrenaline rush that being paged is. But it's really about repeating issues that might come up again hands, maybe more importantly about improving responses, because the same issues sre not always going to come up over and again. So how can we respond better to a novel issue next time? And for post mortems to really be blameless, establishing what happened in an objective way is really important. And I've seen again and again that doing this through real telemetry, especially in a distributed system, using tracing, is really important. So I can think of a number of times when in the writing or the reviewing of a post mortem, there is essentially a disagreement about whose fault a latency problem is. Right? Is it service a is making an incorrect call or is configured incorrectly? Or is it service b is too slow in servicing that request? And if you look at aggregates, if you're just looking at something like p 50 latency, those two teams can have a pretty different perspective of what's going on, especially if they're not accounting for things like the network in between, and if they're not really making sure that they're pairing up slow requests on one side with the same kind of corresponding requests on the other side. And what tracing helps you do is really understand those causal relationships, right? It allows you to pair up a slow request on one side with one service with the response that was part of that request on the other side, and really understand if that slowness is responsible there. And look at the logs, look at the request parameters to understand what service needs to change in order to improve things. So yeah, obviously improving on call is important, not just for the obvious reasons, right. That whats has real impact on your, on your customers experience and on revenue and reputation and things like that. But it has a cost internally as well, right? Because time spent handling pages, writing post mortems, handling those interrupts, that's time that developers and engineers are not spending building new features or doing proactive optimization. Right? So there's a cost to that. And then the stress of being on call has a major impact, I think, on job satisfaction for a lot of developers. And so thinking about that stress that can be mitigated by improving on call, it can be mitigated by having good documentation. And I mentioned reducing the number of pages is also a great way of improving on call. Like giving teams the agency to do that. Right? Like giving teams the agency to say, hey, look, this alert is not valuable, right? It's not helping us meet our goals and so we want to delete it. And that's actually going to make our lives better and make us more productive. But like I said, we need to understand their goals, right? So how do we think about holding teams accountable for on call? Like what are the goals in a way that we can measure? Right. Well, that brings me to my next topic, which is to talk about Slos. So service level objectives, again, I'm just going to give kind of a whirlwind kind of intro to these. There's a lot more that could be said, obviously, but these are promises that service owners make to their customers, right? And those could both be internal customers, these other people within your organization, or end users people external to your organization. And what's important about an slos is that it's stated in a way that can be measured on relatively short timescales. So to give an example of what an SLO looks like, it might be something like 99th percentile latency should be less than 5 seconds over the last five minutes. And to kind of break this down. So the first part is the service level indicator. That's the metric, these thing you're measuring, right. The second part is the threshold. That's kind of the goal in a way. And usually this is expressed as an inequality, right. We want to keep latency down and then finally we have the evaluation window. And I'll say a bit more about that in a second, but that's really important for making sure that we're measuring things in a consistent and precise way. So just to give some examples of other sorts of indicators. So I mentioned latency. You might choose different percentiles depending on what's important to your customers. You might measure error rate. That's important for a lot of folks. Availability is often something whats is promised to customers as well. Depending on your business. You might also measure something like durability or throughput as well. So I mentioned the way that you measure these Slis is important and things idea of a window. So when we look at a dashboard that's showing something like latency, usually what that's showing is what you might call instantaneous latency. And that's good. That's usually these default and that's what we want to see when we're in the middle of an incident. Right. Because that's going to be the most responsive way of measuring this. But if you're trying to measure an slO, the problem with instantaneous latency is if you look on narrower and narrower timescales, it can actually significantly change the value of it. And if there's one thing that's important about slos, it's that we all agree on what the definition is and whats we're all measuring it in the same way. And so when we look at something like latency for an SLO, we're really going to talk about measuring it over something like the last five minutes or over a five minute window. And really what that's doing is looking at all of the requests over that five minute window. If we're looking at P 99, then looking at the fastest 99% of those requests hands, making sure that all of them fit under some threshold. Okay, great. So how do we determine slos? Well, there's a bunch of questions that you need to ask yourself. The first one is, what do your customers expect? What have you promised them already? Right. You might be legally bound to provide a certain level of service, or it might just be that there's an expectation and you can measure conversion and things like that. To understand that users get bored and leave if it takes too long to service requests. Youve should also ask what you can provide today, right. There's no reason to set an Slo that you're not going to be able to meet or that you're not going to be able to meet anytime soon. And so thinking about what is the product roadmap look like? How much time do we have on the engineering team to make changes to improve performance or reliability and making sure that these all line up so that we're doing the best we can for our customers while providing the functionality that they need at the same time. Okay, so how do we actually do that? Right. Let me take a really small, simple example. So say here's a simple microservice based application, just three services in this case, and say whats we've promised our customers that will serve requests 99% of the time within 5 seconds. And of course under some evaluation window and for service a, the one that's, that's labeled a at the top here, that sort of translates immediately to what they're on the hook to provide. But what about internal services? Right? How should this map to service be? Right, so let's look at a trace, right? So how does a request actually flow through these things? And you probably want to look at more than one trace, in fact. But I just pulled out one here just as an example. So now that we see this, we can see, it looks like today, at least in this example, service b is actually responsible for a lot of the latency of service a. So we can also give a kind of similar bound to service b in a lot of ways. That is, it also needs to be able to serve p 99 latency in less than 5 seconds. But what's interesting is that, sure, in the kind of services diagram, there's one arrow between b and c, but in this request, there's actually two requests from b to c that happen in serial, which means that we need c to be twice as fast. Right? So maybe things is what you were thinking, whats p 99 latency for C needs to be less than two and a half seconds. If you think about it, maybe for another minute, you realize that that's not quite correct either. In fact, there's two chances for C to fail in this case as well, right? So there's two chances for C to serve in a server request in more than two and a half seconds. So we actually need the bound to be even tighter than that. It's around 99.5 percentile latency, and that's sort of how we can pass that down to c. Now what's kind of interesting in this is it might be that in some other cases that b also depends on another service D. But at least in terms of this request, in terms of servicing a request that came from A, B doesn't depend on D at all, right? And so thinking about D's slos, actually, we don't have any information to do that from this case. So looking at traces is really important. It's not just enough to look at the service diagram. The trace is really going to tell you what's going to help you there. Okay, so why are slos important? They are really about measuring success in delivering a service, they're about measuring success for on call. Right. These teams can use them as a guide to prioritize work. So if we've established an slo, we can now understand how much improvement we need to make and we can use that to trade off against, say, new feature development. And it's a way of really holding teams accountable consistently across your right. So you want to make sure that as folks move from one team to another that they're not learning new ways of doing this. And if you're going to measure teams performance by their ability to meet their slos, it's really important that you do that consistently as well. And then finally, yeah, these are all things ways about thinking about accountability. But agency is also really important too. And slos are really a way of giving folks a budget for thinking about how much room they have to push more deployments out there. Right? Like how close are they to hitting their slos? And that's really a way for them to build can error budget as well. Okay, so just kind of review my three piece puzzle here. So documentation obviously is an important part of that. I think more important than just documentation for documentation's sake. But it's a way of establishing ownership hands knowing who is going to be held accountable. But if you're going to do that, it absolutely has to be up to date, right? You can't hold people accountable to documentation based upon documentation that's out of date. It's also really critical in building confidence within those teams hands. Along with tools that describe the dynamic state of a system. It's critical information for folks that are on call or need to understand how a system is actually behaving on call. Obviously youve can't do business without it. Incident management is often the part that people think about most. When you say service ownership, but I want to call it that, on call has a lot of other components to it too. A lot of those are really tools and hands. Finally, Slos, right. These are really like how you hold teams accountable, like I said, how you measure their success hands in all of this. I think in a system where you have a loosely coupled architecture where you have teams, whats are moving independently, tracing is really critical to understanding causality in that system, to understanding who is responsible for at a given moment in time, which services are actually contributing to latency. And if you don't have that information, you're not going to be able to keep your documentation up to date. You're not going to be able to make good decisions while you're on call and you're not going to be able to set slos in a way that actually reflects what your customers expect. Okay, so I mentioned error budgets. That's really just one kind of budget. And I think giving folks budget to improve reliability and giving them agency to do that will help them hit their goals and will lower their stress. But that agency requires them having the right information and the time to do it. And so this really comes down to ownership doesn't come for free. You've got to give your teams time to actually invest. You've got to give them time to improve and to make things better. Okay, so, sounds great. How do we get this right? Where do we start? Well, making changes in a DevOps organization, it's hard, right? Rolling out new tools and new processes always has to be a bottom up thing. And whether that's how you run your sprints, which tools you choose to do, developers, what observability tools you use. If they're going to be adopted, they really have to provide value to those application development teams. And ideally more than provide value, they would be a necessary part of their day to day work. If you don't have those things, at least in my experience, it's just going to be a long, long uphill road to get those things deployed and adopted. Then to establish hands, maintain service ownership, use a communication of documentation, on call process and slos, and manufacture a need for those tools, hands processes where necessary. Right. So what I mean by that is just to say make it a requirement to have service ownership defined within the documentation before a service can be defined, before it can be part of the deployment pipeline, like I mentioned. So as a platform team, as part of engineering leadership, you have the ability to actually make these processes required in a way, and if you do that, that'll actually go a long way towards them being adopted hands becoming part of the tool set of the folks in your organization. Okay, so just to kind of sum up, I think of ownership is really having two parts. Obviously, accountability is a big part of it. I think that's what folks think about a lot when they think about ownership hands. That's really setting the deliverables and the goals for the owners within your organization and making sure that youve evaluating their performance based upon those goals and deliverables. Right. That's really how youve make those things sink in. And I think a second and equally important part is to give those teams, agency agency to make change. Right. So they're going to be a lot more inclined and a lot happier oncall if they're able to control and make changes to the kinds of alerts they get, if they're able to make changes to the architecture itself. Right. So making sure that you're offering them the information nation, allowing them to build confidence and giving them these budget to improve is really critical, I think, in establishing service ownership. So with that, I wanted to thank everyone for your attention. You can find me at Dave Spoons on Twitter. You can find me@lightstep.com I'm always excited to talk about service ownership. I'm always excited to talk about distributed tracing.
...

Daniel "Spoons" Spoonhower

CTO & Co-Founder @ LightStep

Daniel Daniel



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)