Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, thank you very much for your interest. I'm really
happy to be here, even if it's only virtual. My name is
Nikolaus Rath, work for Google in one of the London office.
I'm a tech lead for one of the SRE teams and
we are supporting a number of products related to Google's
advertising business. I will be talking about a transition
that our team has undergone over the last one two
years, which we call the transition from a service ownership
to product ownership. If this doesn't tell you
anything, don't worry, that is kind of the idea. Hopefully at the end of my
presentation you'll know what I mean.
My presentation is going to have three main parts.
I'll start by describing how we have operated
in the past as a team and what challenges resulted
from that. In the second part I'll describe the changes that
we made to address these challenges, and in the third part I'll
describe the status quo. What was the effect of these changes?
Is there anything left to do?
Before I go there though, let me tell you a little bit more
about my team. Google has a lot of sres
and a lot of SRE teams and there's therefore considerable
differences between the teams. That is difference
in workload, difference in scope, difference in the
way we operate. Please be aware, this is just
one teams among many. As a matter of fact, some of the teams even
have their own logos to distinguish themselves from others.
I'd really like to show you ours. I'm not allowed
though, so you have to do with this pixelated version.
But if you do know your way around the Internet,
I think you can probably find it. I believe it is available somewhere.
In any case, my team, we are 40
people in total, distributed across two locations, one in
London and one at the US west coast to cover
pager around the clock. And we have two
on call rotations that does not refer to the two different sites.
But it teams that at any given point in time we have at least two
people on call. The reason for that is simply the number
of services that we have, like one person on call is not enough.
My team is also a very long standing one. It was founded more than a
decade ago and over time our scope
increased a lot. Originally we were responsible just
for Adwords when it was created. Now we are responsible for
not all of them, but the majority of Google's advertiser and publisher front
ends should probably explain what that means. If you're not familiar with the advertising
business, basically a publisher is someone like the New York Times. You have a website,
you have lots of users, and you'd like to make money off your website by
showing ads. That is what a publisher does.
Advertiser is the person like Coca Cola, who wants to show ads on
some others webpage. So the advertiser pays the publisher money for
showing ads there. And both of these kind of
customers have interfaces where
they basically sell the ad inventory, the places where
ads can be put, and where they purchase this inventory to show their
ads there. And these are the products that we support.
This is different from what we call ads rending, which is
about actually showing the ad to the user. It has very different constraints.
Our workload is a little unusual. Even within Google.
We spend about 30% of our time on
interrupt work. That is, handling incidents, handling on callbacks.
Then we spend 20% of our time on service maintenance.
This is basically non urgent, routine operations like scaling
the service up, moving it to a different data center. But the vast majority
of our time, about 50%, is really spent on project work.
And this is software engineering work, where we spend
our time building software to further reduce the time
that we spend on interrupt work and service maintenance. And as you can tell,
we are already in a pretty good position there, but you can always do a
little better. This is probably also the right time.
To clarify, I'll often
refer to problems and difficulties that we face
and things like that, but this is basically complaining
at a high level. Like our products all
in all are pretty reliable, have always been pretty reliable,
and people have mostly been happy with it. So this is not
addressing an urgent need that threatened like the reliability of our products.
But it is a relatively high level optimization where
the goal is to spend our time more efficiently.
If things had actually been burning, I don't think we would have been
able to introduce the measures that I'm talking about simply because they
take a lot of time and dedication and don't pay off immediately.
All that being said, let me head into the first part
of my presentation and describe to you how
we operated in the past as a team and what problems.
SRe room where we discovered the room for further
optimization.
Our engagement model in the past was pretty simple.
So SRE support is something that is provided for specific
services. And when I say service, then I mean the same
thing that someone else may call a binary, or an executable, or even a
container. So it is a little piece of software that
runs somewhere that provides some functionality, but it's not something
that is user visible. It is something that is
kind of defined by the implementation architecture of your product.
So product is the other big concept that I want to distinguish.
Product is the thing that the user sees. Like Google Ads is
a product, Google search is a product, but all these individual pieces
of software that provide the function is what we call services.
So these services are the unit at which SRE support
used to be provided. A service is either fully SRE supported or
it's not SRE supported. And support in this case meant
that SRE is handling all pages that SRE is responsible
for the SLO, both reactive. So incident response
and proactive, meaning make sure incidents don't happen in the first place.
And SRE takes care of all the operations work,
meaning scaling the service up, making sure it runs with the required redundancy,
reviewing SLO compliance, all these kind of things.
And I think the best way to summarize really this engagement model is with this
hypothetical quote which goes SRE
support means that we don't have to worry about Ops work anymore.
Said by some hypothetical developer. As I
said, no, I don't think anyone has ever expressed it like that explicitly, but I
think it very well describes the feeling around this model of support.
However, things change over time. I think I mentioned before,
when the team was founded, we were responsible for a single project,
Google Ads, which is now. Well, now it's called Google Ads. Back then it was
adwords, and all services that provided adwords were
SRe supported. But then Google started to launch more
services like Google Ad manager or Google
AdSense. And these also needed SRE support.
So the number of services increased. And then all these products
over time, of course, also gained more features and more users.
Which teams more and more services, because the easiest way to make the
product scale better for a higher number of users is to split it
into smaller services. The easiest way to start a new feature is to package
it in a new service. So as an SRE team, we kind
of scaled up as necessary. This means
we increase our automation, making sure that we can do lots
of operations on all services at the same time. And we made our
services more uniform. And I think at this point,
the degree of automation and uniformity that we have across all
the things that we support is really quite impressive. I want to give a few
examples. In the past, we often spent several days moving
a service from one data center to another, making sure that all
dependencies are brought up and brought down in the right order. These days,
it's a matter of a few minutes, like you commit a simple
change that says, this shiva should be run elsewhere, and then automation
takes care of scheduling all the things in
the right order, even if there's been exceptions in between.
So it's not just scripts that go from beginning to end, but it's
basically a state machine that figures out how do we go to the desired
state, no matter where we currently are.
Every individual binary also includes
an HTTP server that provides monitoring metrics out
of the box. If you link a new library into your binary,
it automatically exports its own set of metrics. In addition
to the one that the binary already has. There's a monitoring
system that picks up as soon as you bring up a service and starts recording
these metrics continuously. And there's even an alerting system
that then infers basic alerts based on the SLO for a service
and applies them to the monitoring metrics. So this is
really quite impressive, at least in my opinion.
But still, even with all this automation, the cost that
we have per service is never zero. So that means there is a
limit in the number of services that we support, and we reach this
limit. So at that point, or already earlier,
because we anticipated we reach it, SRE support is
no longer awarded to every single service that is part
of a product, but is awarded to the most important services, the one that provide
a critical functionality.
But this is where kind of the difficulties start,
because the importance of a service cannot be
determined automatically. Like there needs to be a human
who makes a judgment call. How important is this piece of
functionality for the product as a whole? And this also tends to
change over time. Less important services kind of become more important because
user habits change, or just because the feature develops
over time becomes more important.
And what also often happens is that you have interdependencies.
So formally, less important service is suddenly a
dependency of a more important service. So they really should have the same importance.
So what this means is this service
importance means periodic human review. And it also teams that
really we would like to make changes in which
services receive SRE support and which ones don't.
But this is really expensive. At least historically, awarding SRE supported
what we call onboarding means. We go through this
long production readiness review where we evaluate compliance with
all our best practices and bring the service into compliance.
And dropping SRE support, what we call offboarding,
is a little easier. Technically, you still have to adjust the permissions,
but it is socially quite expensive because
it boils down to telling a specific developer team, look,
your service isn't that important anymore in the big scheme of
things. So that is not an easy message to get across, and it
requires very careful handling and lots of buy in from
lots of stakeholders. It's not something you just do regularly as
a routine operation several times a year.
So this is one part that has been really challenging.
The second one is really that all our binaries
have very homogeneous controlled surfaces, but obviously they
are not the same. We treat them as homogeneous,
but they sre not. They provide very different functionality.
Service a failing may result in a tooltip not being displayed.
Service B failing may mean that the entire page
can't be loaded. And what this means is that
SRes no longer know what each
service is really good for, like the user impact
of something not working. We can tell it doesn't work, but we can't tell what
does it mean for a user. And we also don't have a good
way to find out. As SREs, we have way
too many services to keep this knowledge
as a back end developer as a developer,
you don't really know how the RPCs
are used by the front end, and as a front end developer you know
what a user sees, but you don't really know how a particular RPC chain
fans out once you've handed it over to the backend. So there's really
no one who's in a good position to maintain that knowledge.
So you may wonder, okay, these are all challenges, but why
is that important in practice? So let me
give you two illustrations.
The first one is hypothetically, here we have the scenario that
Google Ads is down. It's not working,
but interesting enough. No SRE is doing
anything about it because it so happens that all SRE supported
binaries, all SRE supported services are within SLO.
This could happen. It wouldn't actually happen that SRE
doesn't do anything like we don't work by the letter. We would still
get involved and try to fix the problem. But formally,
like on paper, our responsibility is
not Google Ads as a whole, but it is, or used to be,
not Google Ads as a whole, but it is a subset of services that
provide the most important functionality.
Less hypothetically, the following scenario,
we get paged because SLO is at risk and a specific issue
that we are dealing with is that at service Paul Wilson recode
status error ratio is 0.5, which is
bigger than 0.5. I guess we have a rounding error here
for over 15 minutes and some more blah blah.
Now you may assume that I know exactly what that means.
The truth is I don't. But most of
the time I can still debug this issue.
I can still mitigate this issue, I can even find a root cause and assign
a bug to the right developer. But at no point in time did I
ever know what this actually meant for the user.
So what I'm trying to say with all this is really that supporting services
means supporting components of a
product, components that are defined not by features,
but by implementation choices that the user doesn't actually see.
And the consequence of this model is that eventually
the overall product reliability is determined by
the unsupported services. The services that don't receive SRE support, they become
the bottleneck. So SRE efforts would give a
much higher return when we invested them into those other services.
But that's not easy. First of all, we are ill equipped to even identify those
services because we don't have an overview over the whole product. We are focused on
the subset of things that we are formally responsible for,
and even if we find another service that urgently needs our attention,
we are kind of ill equipped to shift focus on there because we require
this expensive onboarding procedure of the new service. And then we
need to find another service to offboard, which is typically
even more difficult. So we have kind of a catch 22.
We can either spend our time on the services that are already
supported, but that's not where the real problems are, or we
could spend a lot of time onboarding an offboarding service, which of
course is not an end goal in itself either.
Furthermore, it gets really difficult to
translate issues with a service into the resulting
user experience. That is the example that I just gave you with this weird error
message. Translating this into something understandable is a
nontrivial effort it really takes. Typically when we have an incident,
we spend at least as much time to figure
out the user impact than we spend time fixing the issue.
And finally, it also becomes really difficult to keep
slos correlated with the user experience,
with the user happiness. Why is that? First of all,
operations may be retried. So just because there's an SLO miss
somewhere down the stack, it doesn't mean that the user actually sees
an error, because maybe that request got retried successfully,
but also the other way around. If there's
a problem all the way down the stack
and it kind of propagates upwards. And now as a consequence of
that, every dependency upwards also reports an error. This may look
like a single error to the user, while on our end it looks
like basically the whole system is collapsing.
So it's really difficult to
be confident that if the user is unhappy, our SLO is violated, and that
a violation of our SLO really corresponds to a decrease
in user happiness.
So these were the issues. You're supporting lots
of services, but we don't really have an overview of the
product as a whole and therefore we can't
really tell where's our attention most needed and
what is the effect of problems on the user. So we decided to change that,
and the change is actually pretty simple. To summarize,
we changed our engagement model so that SRE is
no longer supporting these services, aka binaries,
continuous executables. But SRE
supports products. Things like Google Ads
and individual services are
not the responsibility of SRE. But SRE is
responsible that developers are able to
run their services in an efficient manner and in
a reliable manner, whether this is by providing tooling
or providing guidance. But it's not running all
the services and our
attention is focused anywhere
within the product, depending on where we see the most need
at a given time. So in the long term we want to completely abolish this
concept of onboarded supported service and not
supported service. What is supported is the product as a whole.
If you recall the quote that I gave earlier, it was SRE support
means that we don't have to worry about Ops work anymore.
This was the old engagement model. I think this new model is
much better captured by a different quote, and that is that SRE
support means that the product is operated efficiently and works
well for its users. And I think from the difference of these two statements,
you can tell that this is first and foremost a cultural change
and a philosophical change. It's not so much a technical one.
So this probably all sounds well and good, but how did we go about
implementing it in practice? Well, let me tell you about
we did, and it was basically three steps.
The first one was to get a better understanding
of how our product worked for the user. And for that we relied
on this thing called CuI, which stands for critical user interaction.
And a Cui is an interaction of a user with the product,
and that interaction may succeed or fail. And it's
measured was close to the user as possible. That generally means
it's measured in the browser in some JavaScript code, and then
the success or failure shows SRE sent back to the
server where it is locked. So a Cy is associated
with the availability and reliability of a product feature and not
a service, and that feature may be provided by multiple services working
in concert with these cuis being
defined and measured. We then rewrite
our slos so that they apply to cuis,
so to product features rather than services. And this is
what gives us this official change of SRE responsibility.
So it is no longer our job to make sure that individual services stay
within SLO, but our job is that product
features stay within SLO.
Now, with SRE being responsible for the product,
the next question is of course, ok, so who runs the services?
I already mentioned that before, we want developers to do that.
So step two is all
services are owned by developers. Now at this point I
should probably point out Google has lots of SRE teams.
This change is not a Google wide. This change is something
that we apply to the services in our remedy and
are still in some cases in the process of applying.
So SRE no longer runs services, but SRE ensures
that following our best practices is the easiest way for developers
to run a service and SRE
gets involved if the cuis are at risk. So if a feature
doesn't work, rather than if a service doesn't work, and we
are available as an escalation point.
So if a developer is handling an incident and they think okay,
this is a big thing that really needs SRE engagement,
then they are expected and very welcome to pagers.
But we are not the first line of defense. We sre the second line of
defense. To give a different analogy,
I think SRE takes
the role of a production doctor. So the doctor
can tell if the patient is sick and they can typically
also make the patient healthy again. And doctor
tells the patient okay, here's what you need to do to avoid getting
sick. Here's good practices for hygiene. And this is also
what we do as SRE. We can tell when production
is unhealthy. We can typically fix it, and we can tell developers
what they need to do to keep their
systems healthy. But we cannot do this ourselves.
Like the doctor can't make sure that the patient stays happy,
they need to follow the recommendations from their doctor.
And this is also how we see the role of SRE. We give the guidance,
we provide the tools, but implementing it is not something that's
feasible for SRE to do for every individual service.
Instead, SRE engagement with specific
services are now always time limited. So they are scoped to either fix a
particular reliability issue or they are scoped to teach particular
operations related skills so that developers in the future can
take care of a particular issue themselves.
This leaves us with one more issue. We now have a measure of how
well the product works. We have defined when SRE gets engaged,
and we have established that components are run primarily by
developers. But how do we know what to look at?
Well being Sre this is where we finally kind of started
to address this with technical solutions.
So you're probably familiar with the concept of architecture diagrams,
where you have nodes for individual services and
arrows that tell you what talks to what. The problem that
we had with these diagrams is that they were effectively useless because they were
way too big, like there SrE so many services, and really everything seems
to talk to almost anything else at any given point in time. So they
wouldn't help you very much. So what we decided is we need
a better way to generate these diagrams,
and we wanted to generate them dynamically from RPC traces
for cuis, so that
we could tell for a given CuI for a given feature
which services are involved in serving that feature. That was
our first development project, and the second one is that we
looked at our source control system. We said, look, we have this great
feature of a CI that runs all the tests and ensures that
the proposed changes are good and ready for commit.
Why don't we have something similar for our production setup?
So the idea is here that we have some CI
for the production setup that continuously checks everything that
runs in production against our best practices. And then
we have dashboards that really highlight the production set
overall and indicate us, okay, here's the services that are addressed.
This is where we should spend our time on.
Which brings me to the present. Where did all of
this lead? And oops,
that was one too far. And the thing that I
want to emphasize most is that the changes have been welcomed.
This is the thing that is both the most important one, this being primarily
a social change rather than a technical one, and is also the
one that we were most worried about, like, is everyone going to share
our opinion? But luckily we were able
to convince pretty much everyone that this new
model of engaging is a better use of
everyone's time. Of course, people in the dev.org,
the further down you go from a higher level leadership, are a
little worried that they'll have more to do. But there was pretty much universal
buy in. But let
me go into a little more detail of where we are. So we have
defined cuis for many of our projects. We have
redefined our slos for them,
for some of our products as well, and we're in the process of
extending coverage to all the things that we support.
We also have these architecture diagrams available. They are computed
on demand for particular cuis.
We concluded several model engagements that included
both teaching about production best practices
and what we call reliability deep dives. And this really enabled us to
take a look at individual issues in much more depth than we have been
able to do before. We also defined
and wrote up what we call a production curriculum. It is basically the set
of production skills that we think developers need to have and
should have at their disposal to run their services reliably. Something that's
not in there is the thing where SRE steps in, and we're currently in
process of creating training materials for that. And finally,
we put in place all the infrastructure for this production CI. So our
services are now continuously monitored for compliance with
all the best practices. And the
UI at the moment is still a little bit rough. Like we can use it
to retrieve data, but we need to make it a lot more neater so that
we can bring it to developer leadership and show them. Okay,
here's an overview of your product. Here is where we are going to spend our
time now, because there's things that needs looking into.
Let me come back to the example that I had earlier. This was this
hypothetical, or actually not so hypothetical page that we get in
the past we had ad service Paul Wilson recodes that, blah,
blah, blah, blah, blah. This was an alert for a particular service.
What we now have with the introduction of cuis is the following.
Some publishers are not able to preview video ads
error budget will be exhausted in 78 minutes. So at this
point we can basically read off what the user impact is because
our alerting is now based on the user impact.
Note that this does not completely replace the old alerts.
The new alerts and the cuis define when SRE gets involved.
And once we are involved, we'll of course also look at per service metrics to
determine, well, where exactly is the root cause and how do we mitigate
it. Also like to show you
an example of an architecture like
the service names here are fictitious, but it doesn't matter
for the point I'm trying to make. So on the left,
in kind of this turkish shade, you see two cuis. Basically it's
loading one page of the application, loading another page of the application,
and the arrows indicate which services
are involved in completing this CUI.
You'll see that there's in some cases errors that go back and
forth. That truly means that, for example, the myrepper data on
the top service gets called by my data API, but then called
back into that service as well. Sometimes that can't be avoided.
And everything that is kind of in those oval circles is services
that are within our remit. And if you click on those, you would get
a per service summary of what does this service
look like? Is there any current issues? Then there's the kind
of other kind of rectangular boxes where you have arrows. These are
services within Google that are supported by other SRE teams. So this is automatically
narrowed down to services that actually are our responsibility
and to services that are involved in serving a particular
CUI. Most important, however, is you can actually
select multiple cuis. Suppose they were both failing and then compute
the intersection and only show the intersection of
services that are involved. And this enabled us to really quickly narrow
down and drill down to the service that is most likely
causing the issues.
A few more things are planned, so we still need to extend CUI
coverage to all our products. Some are completely covered at the
moment, some are partially covered, some still need
work. And the second big thing that we still
need to do is completely phase out this legacy status of the
supported services. Remember, we promised to do all operations work
for some SRE supported services and we have not
yet stopped doing that for the services for which we were already doing
that. Basically, we wanted to be very sure that
everyone is on board with a change, that everything is working exactly as planned,
and this will be the next big step. One thing that we need to do
before we do that is kind of fix a small bug or
fix a small missing feature in our monitoring and alerting setup.
Currently we simply don't have enough granularity to route alerts
like we cannot distinguish is a feature completely broken or are
just a few users affected. And this is another thing
we want to address before we then finally start to
phase out this SRE as a full service. So let me summarize
my talk. In my team, we've developed
a lot of automation and we have standardized all the services
that we support to a very high degree, and this enabled
us to scale our support to really hundreds and hundreds
of different services. However, the price
that we paid for this was a disconnect of
SRE from the user experience. So while
we could identify issues with these individual services
and mitigate them and even root cause them, we had a lot
of trouble to explain what was
the user impact from these issues, like what did the user experience.
Furthermore, products being built from more services than
we can support eventually leads to an inefficient use of our
time, because we spend a lot of time looking at
the same services, while the real bottleneck for the products reliability is
in the services that we currently don't support.
And we address these issues by changing our engagement
model and rescoping our support from individual services
to products as a whole.
And we put developers in charge of running their
services with SRE being responsible not
to run the services individually, but to make sure that
developers are empowered to do that on their own.
Following SRE best practices that SRE's job
is to make sure that following best practices and running the
service reliably is basically the easiest way to run a service,
and overall the changes have been a success.
We are still implementing some parts of it, but the majority
is there and we now have a much
better understanding of our products from the
user's point of view. We can immediately tell what is
broken for the user and we therefore have a lot more confidence that we
are able to maintain user happiness. And we also
gained a much better understanding of product health as a
whole, which has enabled us to focus our
efforts flexibly on the services
that most need our attention, and therefore to make more efficient
use of our time.
This brings me to the end of my presentation. Normally I'd ask for
questions. This being a recording,
of course, makes that impossible. Still,
if there is questions, you're welcome to send them by email.
I won't promise that I'll respond to all of them because, well, I don't know
how many questions there are and depending on the kind
of question, I may have to get approval to make sure that I can share
something externally, which just may not always be worth the effort.
But please do feel welcome and I will try
to do what I can. I hope this presentation
was interesting and worth your time. Thank you very much
for your attention.