Conf42 Site Reliability Engineering 2020 - Online

- premiere 5PM GMT

From application to product ownership: an SRE team's journey

Video size:

Abstract

In the past, my team was responsible for specific applications/ executables. Now, we are responsible for specific for end-user workflows, no matter which executables they involve. I will describe the technical and social changes that necessitated and enabled this change of paradigm. At the end of 2019, our team supported in the order of 200 executables/microservices that provide functionality for Google’s products in the advertising space.

This number was the result of continuous growth over more than 10 years.

Summary

  • Nikolaus Rath is a tech lead for one of the SRE teams in Google's London office. He will talk about the transition from a service ownership to product ownership. His presentation will have three main parts.
  • My team, we are 40 people in total, distributed across two locations, one in London and one at the US west coast to cover pager around the clock. At any given point in time we have at least two people on call. The reason for that is simply the number of services that we have.
  • Google spends about 30% of its time on interrupt work and 20% on service maintenance. The vast majority of our time, about 50%, is really spent on project work. This is not addressing an urgent need that threatened the reliability of our products. The goal is to spend our time more efficiently.
  • SRE support is something that is provided for specific services. Things change over time. The easiest way to make the product scale better for a higher number of users is to split it into smaller services. The degree of automation and uniformity that we have across all the things that we support is really quite impressive.
  • SRE is no longer supporting these services, aka binaries, continuous executables. But SRE supports products. The overall product reliability is determined by the unsupported services. It gets really difficult to translate issues with a service into the resulting user experience.
  • We have defined cuis for many of our projects. We also defined and wrote up what we call a production curriculum. Our services are now continuously monitored for compliance with all the best practices. We still need to extend CUI coverage to all our products. But the changes have been welcomed.
  • The real bottleneck for the products reliability is in the services that we currently don't support. We changed our engagement model and rescoping our support from individual services to products as a whole. We put developers in charge of running their services. Overall the changes have been a success.
  • This brings me to the end of my presentation. Normally I'd ask for questions. This being a recording, of course, makes that impossible. Still, if there is questions, you're welcome to send them by email. I hope this presentation was interesting and worth your time.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, thank you very much for your interest. I'm really happy to be here, even if it's only virtual. My name is Nikolaus Rath, work for Google in one of the London office. I'm a tech lead for one of the SRE teams and we are supporting a number of products related to Google's advertising business. I will be talking about a transition that our team has undergone over the last one two years, which we call the transition from a service ownership to product ownership. If this doesn't tell you anything, don't worry, that is kind of the idea. Hopefully at the end of my presentation you'll know what I mean. My presentation is going to have three main parts. I'll start by describing how we have operated in the past as a team and what challenges resulted from that. In the second part I'll describe the changes that we made to address these challenges, and in the third part I'll describe the status quo. What was the effect of these changes? Is there anything left to do? Before I go there though, let me tell you a little bit more about my team. Google has a lot of sres and a lot of SRE teams and there's therefore considerable differences between the teams. That is difference in workload, difference in scope, difference in the way we operate. Please be aware, this is just one teams among many. As a matter of fact, some of the teams even have their own logos to distinguish themselves from others. I'd really like to show you ours. I'm not allowed though, so you have to do with this pixelated version. But if you do know your way around the Internet, I think you can probably find it. I believe it is available somewhere. In any case, my team, we are 40 people in total, distributed across two locations, one in London and one at the US west coast to cover pager around the clock. And we have two on call rotations that does not refer to the two different sites. But it teams that at any given point in time we have at least two people on call. The reason for that is simply the number of services that we have, like one person on call is not enough. My team is also a very long standing one. It was founded more than a decade ago and over time our scope increased a lot. Originally we were responsible just for Adwords when it was created. Now we are responsible for not all of them, but the majority of Google's advertiser and publisher front ends should probably explain what that means. If you're not familiar with the advertising business, basically a publisher is someone like the New York Times. You have a website, you have lots of users, and you'd like to make money off your website by showing ads. That is what a publisher does. Advertiser is the person like Coca Cola, who wants to show ads on some others webpage. So the advertiser pays the publisher money for showing ads there. And both of these kind of customers have interfaces where they basically sell the ad inventory, the places where ads can be put, and where they purchase this inventory to show their ads there. And these are the products that we support. This is different from what we call ads rending, which is about actually showing the ad to the user. It has very different constraints. Our workload is a little unusual. Even within Google. We spend about 30% of our time on interrupt work. That is, handling incidents, handling on callbacks. Then we spend 20% of our time on service maintenance. This is basically non urgent, routine operations like scaling the service up, moving it to a different data center. But the vast majority of our time, about 50%, is really spent on project work. And this is software engineering work, where we spend our time building software to further reduce the time that we spend on interrupt work and service maintenance. And as you can tell, we are already in a pretty good position there, but you can always do a little better. This is probably also the right time. To clarify, I'll often refer to problems and difficulties that we face and things like that, but this is basically complaining at a high level. Like our products all in all are pretty reliable, have always been pretty reliable, and people have mostly been happy with it. So this is not addressing an urgent need that threatened like the reliability of our products. But it is a relatively high level optimization where the goal is to spend our time more efficiently. If things had actually been burning, I don't think we would have been able to introduce the measures that I'm talking about simply because they take a lot of time and dedication and don't pay off immediately. All that being said, let me head into the first part of my presentation and describe to you how we operated in the past as a team and what problems. SRe room where we discovered the room for further optimization. Our engagement model in the past was pretty simple. So SRE support is something that is provided for specific services. And when I say service, then I mean the same thing that someone else may call a binary, or an executable, or even a container. So it is a little piece of software that runs somewhere that provides some functionality, but it's not something that is user visible. It is something that is kind of defined by the implementation architecture of your product. So product is the other big concept that I want to distinguish. Product is the thing that the user sees. Like Google Ads is a product, Google search is a product, but all these individual pieces of software that provide the function is what we call services. So these services are the unit at which SRE support used to be provided. A service is either fully SRE supported or it's not SRE supported. And support in this case meant that SRE is handling all pages that SRE is responsible for the SLO, both reactive. So incident response and proactive, meaning make sure incidents don't happen in the first place. And SRE takes care of all the operations work, meaning scaling the service up, making sure it runs with the required redundancy, reviewing SLO compliance, all these kind of things. And I think the best way to summarize really this engagement model is with this hypothetical quote which goes SRE support means that we don't have to worry about Ops work anymore. Said by some hypothetical developer. As I said, no, I don't think anyone has ever expressed it like that explicitly, but I think it very well describes the feeling around this model of support. However, things change over time. I think I mentioned before, when the team was founded, we were responsible for a single project, Google Ads, which is now. Well, now it's called Google Ads. Back then it was adwords, and all services that provided adwords were SRe supported. But then Google started to launch more services like Google Ad manager or Google AdSense. And these also needed SRE support. So the number of services increased. And then all these products over time, of course, also gained more features and more users. Which teams more and more services, because the easiest way to make the product scale better for a higher number of users is to split it into smaller services. The easiest way to start a new feature is to package it in a new service. So as an SRE team, we kind of scaled up as necessary. This means we increase our automation, making sure that we can do lots of operations on all services at the same time. And we made our services more uniform. And I think at this point, the degree of automation and uniformity that we have across all the things that we support is really quite impressive. I want to give a few examples. In the past, we often spent several days moving a service from one data center to another, making sure that all dependencies are brought up and brought down in the right order. These days, it's a matter of a few minutes, like you commit a simple change that says, this shiva should be run elsewhere, and then automation takes care of scheduling all the things in the right order, even if there's been exceptions in between. So it's not just scripts that go from beginning to end, but it's basically a state machine that figures out how do we go to the desired state, no matter where we currently are. Every individual binary also includes an HTTP server that provides monitoring metrics out of the box. If you link a new library into your binary, it automatically exports its own set of metrics. In addition to the one that the binary already has. There's a monitoring system that picks up as soon as you bring up a service and starts recording these metrics continuously. And there's even an alerting system that then infers basic alerts based on the SLO for a service and applies them to the monitoring metrics. So this is really quite impressive, at least in my opinion. But still, even with all this automation, the cost that we have per service is never zero. So that means there is a limit in the number of services that we support, and we reach this limit. So at that point, or already earlier, because we anticipated we reach it, SRE support is no longer awarded to every single service that is part of a product, but is awarded to the most important services, the one that provide a critical functionality. But this is where kind of the difficulties start, because the importance of a service cannot be determined automatically. Like there needs to be a human who makes a judgment call. How important is this piece of functionality for the product as a whole? And this also tends to change over time. Less important services kind of become more important because user habits change, or just because the feature develops over time becomes more important. And what also often happens is that you have interdependencies. So formally, less important service is suddenly a dependency of a more important service. So they really should have the same importance. So what this means is this service importance means periodic human review. And it also teams that really we would like to make changes in which services receive SRE support and which ones don't. But this is really expensive. At least historically, awarding SRE supported what we call onboarding means. We go through this long production readiness review where we evaluate compliance with all our best practices and bring the service into compliance. And dropping SRE support, what we call offboarding, is a little easier. Technically, you still have to adjust the permissions, but it is socially quite expensive because it boils down to telling a specific developer team, look, your service isn't that important anymore in the big scheme of things. So that is not an easy message to get across, and it requires very careful handling and lots of buy in from lots of stakeholders. It's not something you just do regularly as a routine operation several times a year. So this is one part that has been really challenging. The second one is really that all our binaries have very homogeneous controlled surfaces, but obviously they are not the same. We treat them as homogeneous, but they sre not. They provide very different functionality. Service a failing may result in a tooltip not being displayed. Service B failing may mean that the entire page can't be loaded. And what this means is that SRes no longer know what each service is really good for, like the user impact of something not working. We can tell it doesn't work, but we can't tell what does it mean for a user. And we also don't have a good way to find out. As SREs, we have way too many services to keep this knowledge as a back end developer as a developer, you don't really know how the RPCs are used by the front end, and as a front end developer you know what a user sees, but you don't really know how a particular RPC chain fans out once you've handed it over to the backend. So there's really no one who's in a good position to maintain that knowledge. So you may wonder, okay, these are all challenges, but why is that important in practice? So let me give you two illustrations. The first one is hypothetically, here we have the scenario that Google Ads is down. It's not working, but interesting enough. No SRE is doing anything about it because it so happens that all SRE supported binaries, all SRE supported services are within SLO. This could happen. It wouldn't actually happen that SRE doesn't do anything like we don't work by the letter. We would still get involved and try to fix the problem. But formally, like on paper, our responsibility is not Google Ads as a whole, but it is, or used to be, not Google Ads as a whole, but it is a subset of services that provide the most important functionality. Less hypothetically, the following scenario, we get paged because SLO is at risk and a specific issue that we are dealing with is that at service Paul Wilson recode status error ratio is 0.5, which is bigger than 0.5. I guess we have a rounding error here for over 15 minutes and some more blah blah. Now you may assume that I know exactly what that means. The truth is I don't. But most of the time I can still debug this issue. I can still mitigate this issue, I can even find a root cause and assign a bug to the right developer. But at no point in time did I ever know what this actually meant for the user. So what I'm trying to say with all this is really that supporting services means supporting components of a product, components that are defined not by features, but by implementation choices that the user doesn't actually see. And the consequence of this model is that eventually the overall product reliability is determined by the unsupported services. The services that don't receive SRE support, they become the bottleneck. So SRE efforts would give a much higher return when we invested them into those other services. But that's not easy. First of all, we are ill equipped to even identify those services because we don't have an overview over the whole product. We are focused on the subset of things that we are formally responsible for, and even if we find another service that urgently needs our attention, we are kind of ill equipped to shift focus on there because we require this expensive onboarding procedure of the new service. And then we need to find another service to offboard, which is typically even more difficult. So we have kind of a catch 22. We can either spend our time on the services that are already supported, but that's not where the real problems are, or we could spend a lot of time onboarding an offboarding service, which of course is not an end goal in itself either. Furthermore, it gets really difficult to translate issues with a service into the resulting user experience. That is the example that I just gave you with this weird error message. Translating this into something understandable is a nontrivial effort it really takes. Typically when we have an incident, we spend at least as much time to figure out the user impact than we spend time fixing the issue. And finally, it also becomes really difficult to keep slos correlated with the user experience, with the user happiness. Why is that? First of all, operations may be retried. So just because there's an SLO miss somewhere down the stack, it doesn't mean that the user actually sees an error, because maybe that request got retried successfully, but also the other way around. If there's a problem all the way down the stack and it kind of propagates upwards. And now as a consequence of that, every dependency upwards also reports an error. This may look like a single error to the user, while on our end it looks like basically the whole system is collapsing. So it's really difficult to be confident that if the user is unhappy, our SLO is violated, and that a violation of our SLO really corresponds to a decrease in user happiness. So these were the issues. You're supporting lots of services, but we don't really have an overview of the product as a whole and therefore we can't really tell where's our attention most needed and what is the effect of problems on the user. So we decided to change that, and the change is actually pretty simple. To summarize, we changed our engagement model so that SRE is no longer supporting these services, aka binaries, continuous executables. But SRE supports products. Things like Google Ads and individual services are not the responsibility of SRE. But SRE is responsible that developers are able to run their services in an efficient manner and in a reliable manner, whether this is by providing tooling or providing guidance. But it's not running all the services and our attention is focused anywhere within the product, depending on where we see the most need at a given time. So in the long term we want to completely abolish this concept of onboarded supported service and not supported service. What is supported is the product as a whole. If you recall the quote that I gave earlier, it was SRE support means that we don't have to worry about Ops work anymore. This was the old engagement model. I think this new model is much better captured by a different quote, and that is that SRE support means that the product is operated efficiently and works well for its users. And I think from the difference of these two statements, you can tell that this is first and foremost a cultural change and a philosophical change. It's not so much a technical one. So this probably all sounds well and good, but how did we go about implementing it in practice? Well, let me tell you about we did, and it was basically three steps. The first one was to get a better understanding of how our product worked for the user. And for that we relied on this thing called CuI, which stands for critical user interaction. And a Cui is an interaction of a user with the product, and that interaction may succeed or fail. And it's measured was close to the user as possible. That generally means it's measured in the browser in some JavaScript code, and then the success or failure shows SRE sent back to the server where it is locked. So a Cy is associated with the availability and reliability of a product feature and not a service, and that feature may be provided by multiple services working in concert with these cuis being defined and measured. We then rewrite our slos so that they apply to cuis, so to product features rather than services. And this is what gives us this official change of SRE responsibility. So it is no longer our job to make sure that individual services stay within SLO, but our job is that product features stay within SLO. Now, with SRE being responsible for the product, the next question is of course, ok, so who runs the services? I already mentioned that before, we want developers to do that. So step two is all services are owned by developers. Now at this point I should probably point out Google has lots of SRE teams. This change is not a Google wide. This change is something that we apply to the services in our remedy and are still in some cases in the process of applying. So SRE no longer runs services, but SRE ensures that following our best practices is the easiest way for developers to run a service and SRE gets involved if the cuis are at risk. So if a feature doesn't work, rather than if a service doesn't work, and we are available as an escalation point. So if a developer is handling an incident and they think okay, this is a big thing that really needs SRE engagement, then they are expected and very welcome to pagers. But we are not the first line of defense. We sre the second line of defense. To give a different analogy, I think SRE takes the role of a production doctor. So the doctor can tell if the patient is sick and they can typically also make the patient healthy again. And doctor tells the patient okay, here's what you need to do to avoid getting sick. Here's good practices for hygiene. And this is also what we do as SRE. We can tell when production is unhealthy. We can typically fix it, and we can tell developers what they need to do to keep their systems healthy. But we cannot do this ourselves. Like the doctor can't make sure that the patient stays happy, they need to follow the recommendations from their doctor. And this is also how we see the role of SRE. We give the guidance, we provide the tools, but implementing it is not something that's feasible for SRE to do for every individual service. Instead, SRE engagement with specific services are now always time limited. So they are scoped to either fix a particular reliability issue or they are scoped to teach particular operations related skills so that developers in the future can take care of a particular issue themselves. This leaves us with one more issue. We now have a measure of how well the product works. We have defined when SRE gets engaged, and we have established that components are run primarily by developers. But how do we know what to look at? Well being Sre this is where we finally kind of started to address this with technical solutions. So you're probably familiar with the concept of architecture diagrams, where you have nodes for individual services and arrows that tell you what talks to what. The problem that we had with these diagrams is that they were effectively useless because they were way too big, like there SrE so many services, and really everything seems to talk to almost anything else at any given point in time. So they wouldn't help you very much. So what we decided is we need a better way to generate these diagrams, and we wanted to generate them dynamically from RPC traces for cuis, so that we could tell for a given CuI for a given feature which services are involved in serving that feature. That was our first development project, and the second one is that we looked at our source control system. We said, look, we have this great feature of a CI that runs all the tests and ensures that the proposed changes are good and ready for commit. Why don't we have something similar for our production setup? So the idea is here that we have some CI for the production setup that continuously checks everything that runs in production against our best practices. And then we have dashboards that really highlight the production set overall and indicate us, okay, here's the services that are addressed. This is where we should spend our time on. Which brings me to the present. Where did all of this lead? And oops, that was one too far. And the thing that I want to emphasize most is that the changes have been welcomed. This is the thing that is both the most important one, this being primarily a social change rather than a technical one, and is also the one that we were most worried about, like, is everyone going to share our opinion? But luckily we were able to convince pretty much everyone that this new model of engaging is a better use of everyone's time. Of course, people in the dev.org, the further down you go from a higher level leadership, are a little worried that they'll have more to do. But there was pretty much universal buy in. But let me go into a little more detail of where we are. So we have defined cuis for many of our projects. We have redefined our slos for them, for some of our products as well, and we're in the process of extending coverage to all the things that we support. We also have these architecture diagrams available. They are computed on demand for particular cuis. We concluded several model engagements that included both teaching about production best practices and what we call reliability deep dives. And this really enabled us to take a look at individual issues in much more depth than we have been able to do before. We also defined and wrote up what we call a production curriculum. It is basically the set of production skills that we think developers need to have and should have at their disposal to run their services reliably. Something that's not in there is the thing where SRE steps in, and we're currently in process of creating training materials for that. And finally, we put in place all the infrastructure for this production CI. So our services are now continuously monitored for compliance with all the best practices. And the UI at the moment is still a little bit rough. Like we can use it to retrieve data, but we need to make it a lot more neater so that we can bring it to developer leadership and show them. Okay, here's an overview of your product. Here is where we are going to spend our time now, because there's things that needs looking into. Let me come back to the example that I had earlier. This was this hypothetical, or actually not so hypothetical page that we get in the past we had ad service Paul Wilson recodes that, blah, blah, blah, blah, blah. This was an alert for a particular service. What we now have with the introduction of cuis is the following. Some publishers are not able to preview video ads error budget will be exhausted in 78 minutes. So at this point we can basically read off what the user impact is because our alerting is now based on the user impact. Note that this does not completely replace the old alerts. The new alerts and the cuis define when SRE gets involved. And once we are involved, we'll of course also look at per service metrics to determine, well, where exactly is the root cause and how do we mitigate it. Also like to show you an example of an architecture like the service names here are fictitious, but it doesn't matter for the point I'm trying to make. So on the left, in kind of this turkish shade, you see two cuis. Basically it's loading one page of the application, loading another page of the application, and the arrows indicate which services are involved in completing this CUI. You'll see that there's in some cases errors that go back and forth. That truly means that, for example, the myrepper data on the top service gets called by my data API, but then called back into that service as well. Sometimes that can't be avoided. And everything that is kind of in those oval circles is services that are within our remit. And if you click on those, you would get a per service summary of what does this service look like? Is there any current issues? Then there's the kind of other kind of rectangular boxes where you have arrows. These are services within Google that are supported by other SRE teams. So this is automatically narrowed down to services that actually are our responsibility and to services that are involved in serving a particular CUI. Most important, however, is you can actually select multiple cuis. Suppose they were both failing and then compute the intersection and only show the intersection of services that are involved. And this enabled us to really quickly narrow down and drill down to the service that is most likely causing the issues. A few more things are planned, so we still need to extend CUI coverage to all our products. Some are completely covered at the moment, some are partially covered, some still need work. And the second big thing that we still need to do is completely phase out this legacy status of the supported services. Remember, we promised to do all operations work for some SRE supported services and we have not yet stopped doing that for the services for which we were already doing that. Basically, we wanted to be very sure that everyone is on board with a change, that everything is working exactly as planned, and this will be the next big step. One thing that we need to do before we do that is kind of fix a small bug or fix a small missing feature in our monitoring and alerting setup. Currently we simply don't have enough granularity to route alerts like we cannot distinguish is a feature completely broken or are just a few users affected. And this is another thing we want to address before we then finally start to phase out this SRE as a full service. So let me summarize my talk. In my team, we've developed a lot of automation and we have standardized all the services that we support to a very high degree, and this enabled us to scale our support to really hundreds and hundreds of different services. However, the price that we paid for this was a disconnect of SRE from the user experience. So while we could identify issues with these individual services and mitigate them and even root cause them, we had a lot of trouble to explain what was the user impact from these issues, like what did the user experience. Furthermore, products being built from more services than we can support eventually leads to an inefficient use of our time, because we spend a lot of time looking at the same services, while the real bottleneck for the products reliability is in the services that we currently don't support. And we address these issues by changing our engagement model and rescoping our support from individual services to products as a whole. And we put developers in charge of running their services with SRE being responsible not to run the services individually, but to make sure that developers are empowered to do that on their own. Following SRE best practices that SRE's job is to make sure that following best practices and running the service reliably is basically the easiest way to run a service, and overall the changes have been a success. We are still implementing some parts of it, but the majority is there and we now have a much better understanding of our products from the user's point of view. We can immediately tell what is broken for the user and we therefore have a lot more confidence that we are able to maintain user happiness. And we also gained a much better understanding of product health as a whole, which has enabled us to focus our efforts flexibly on the services that most need our attention, and therefore to make more efficient use of our time. This brings me to the end of my presentation. Normally I'd ask for questions. This being a recording, of course, makes that impossible. Still, if there is questions, you're welcome to send them by email. I won't promise that I'll respond to all of them because, well, I don't know how many questions there are and depending on the kind of question, I may have to get approval to make sure that I can share something externally, which just may not always be worth the effort. But please do feel welcome and I will try to do what I can. I hope this presentation was interesting and worth your time. Thank you very much for your attention.
...

Nikolaus Rath

Tech Lead @ Google (SRE) London

Nikolaus Rath's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)