Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. My name is Jayaganesh Kalyanasundaram.
I am a software engineer at Google, and in this talk I'll be
speaking about product management in SRE. This talk is going to be about the
SRE principles and how product management fits into
these. And I'm speaking about it specifically because
personally, our whole team benefited
a lot by having a dedicated product manager to take
care of all the product aspects in the SRE.
And this is also to make sure that people
sre aware that Sre are not just to do the operations
work, but they sre responsible for the overall reliability of the product.
And hence, how can we benefit from
realizing the product aspects within SRE.
So I'll be going over the generic SRE principles and
how we benefited as a team in each of these aspects by
having a dedicated product manager. So let's say the first one
about service level objectives. Pretty straightforward,
right? When service level objectives are met, when these slos are
met, you have happy users. When they're not met, you have
sad users. But this gets
more tricky, especially in this day and age,
because we are nowadays moving from monoliths to microservices,
and rightly so, because we have a lot of other benefits from having microservices.
The difference is that earlier the monolith was
measuring the overall service level objective,
including the user paradigms. But now, because we
have microservices, and because we have platforms and a lot of products which
by default have monitorings and stuff for these microservices,
service level objectives are not confined for these microservices.
We want to be able to have a larger picture in mind.
So let's take an example. In our CI CD platform,
we have a simple action of doing a rollback, and I'm pretty sure
many of you would be using this one on a day to day basis,
whenever your system doesn't function as intended, we hit the button rollback.
Right now, the engineers
wanted to measure the slos, the rollback,
as the journey starts, from the user hitting the button rollback
all the way to a specific RPC being sent
to the right system to initiate the rollback.
Makes sense, right? The managers
and also tech leads were like, you know what? We wouldn't stop just at
the point where we have an RPC. We would also wait for
the workflow instantiation, which does the rollback.
But the people who sre focused on the product, which is management,
and our product managers, they felt that rollback
is a journey. It's not something we want to be accountable for. Just as a
button, but the whole journey by itself, which is when I
click the button rollback, I want everything to function smoothly
and the rollback to actually happen. So right
here we have total of three different interpretations of how a rollback journey
could be described and what is the service level journey.
We are looking at the targets and the objectives are next because
of which our product manager made a beautiful framework
of writing down the overall critical user
journeys. What I mean, writing down having an explicit document which
states clearly what is the end to end user journey
for each of them and why did we benefit from it was that everyone
was on the same page when they spoke about any specific journey and
what the SLO should be for that. It also helped us
a bit more that we were able to kind of realize the overall product
quality. What I mean here is, let's say you have 15 journeys
offered from the product, but we have slos for governing
only five of them. We are just covering one third of the journeys
and the other two thirds are not actually governed. So which
means that they could be not functioning as well and we wouldn't be alerted
for that. And that's not a right product. Right. So we were also able to
measure the product quality with this framework. And right just there,
we, weve able to benefit a lot from product management in just this one aspect
of slos. So next topic of making tomorrow
better. And this becomes more and more important in the earlier
phase of SRE in any team, because initially SRE
is the one which holds a pager. So they're more like the pager monkey
for the initial few times until they kind of realize the pattern,
realize the major pitfalls, and try to improve
on them in the future. But we can only improve on them when
we have the error budget for that. What I mean here is, let's say that
your sLo states that your service can be down for five minutes
in one month, but let's say you are just pretty much ten days
past in a month and you're already down for four months and
you just have 1 minute more for the next 20 days and
you will be forced to hold the pager and do nothing more because any 1
minute more downtime you
have hit your maximum error budget. But we want to be having an error
budget policy which enforces us to work on the product more.
And this is yet another aspect where we benefited a lot
from the product manager because they were able to help us navigate these discussions with
the stakeholders, where we basically weve able to advocate for,
you know, what we will not be doing any more feature development. We'll be
able to stop it and we will literally improve
the reliability of the product. So these kind of discussions or these kind of initiatives,
which are pretty difficult for the product itself because you're going to compromise
on the feature velocity and putting
more focus on the reliability. Rightly so,
but it also enforces that you need to be
able to have these communications with the upper management and trying to convince them.
And the product manager having that interest in the product is one of
the best persons to do this. And in
terms of product work, as I mentioned before, a lot of things like automating repetitive
work and trying to look into the postmortem action items and trying to say that,
you know what, this whole quarter we'll be just focusing on postmortem
action items because we haven't been doing so for the last two years.
Again, that's a very bad state to be in. But these kind of difficult discussions
weve you pause. The feature work or the more shiny work
to focus on the overall reliability of the product
are difficult discussions which sre to be taken with the hat of the
product manager. Next, let's look at the shared
responsibility model. So as I mentioned before, the general
tendency is that all the work gets dumped on the
SRE. All the toils of work gets dumped on the sres.
They sre more like the pager monkeys,
whereas as I mentioned before, it's supposed to be a shared model
where SRE are responsible for the overall
system's reliability. So in a classical system
development model, from the stage of having a business idea to
doing the initial business modeling, to doing the development
and to the operations, or to ship the features and ship
the whole development, to actually making revenue of it.
SRE ideally needs to be involved from business to
development to also operations like traditionally, they have just been
the operational people. That's why we have the whole aspect of DevOps, which is
like shared model between development and operations.
But if they were involved in the business to development model as early as
possible, it helps them to have a voice and
an opinion in the way the product is being developed.
Looking at the scale and the overall reliability
needs of the product, scalability being an important
aspect, you want to be able to ensure that the product is built to
scale at the initial time, rather than building
it in some ad hoc fashion and then investing a lot more time
on rebuilding it for the scale. Again,
these are difficult things which require a lot of
leadership, buying and hence a product manager,
a product owner influence makes a big impact here.
So an example from my own team about leadership
volume. So as I mentioned before, we were looking at the rollback journey as a
whole, as to the user clicking the rollback all the way to
the rollback actually finishing successfully being the user
journey. And we wanted to measure the Slo for this.
Just like any other SLO, we started with a 99.99%
target. That four nights is the target
for any rollback attempt to finish successfully.
To our surprise, initially the success
rate is 40%. It's nowhere close to
four nine. It's like less than half, right.
The reason was more than 50%
of these errors was because the user
didn't have the right access to have
a rollback initiated for the microservice. To give an example,
a user or an engineer who works for Google Search
shouldn't be able to roll back a software which is working
for YouTube for the matter. Right, because you
wouldn't want to dismantle or
work on some other person's product. Again,
in this case, it wasn't a totally different product, but it was different microservices
where the user didn't have the access to necessary access to.
Again, this is not specifically an issue of the product. The product is functioning perfectly
well. It's the issue of the user. But we
still wanted to measure the end to end journey, because if there are unhappy users,
that doesn't transfer to a good user experience. So we want
to better the user experience. So probably what we had
in mind was to improve the user experience by letting them know that they don't
have the access, which is what we have done finally. We finally have let them
know in a big red bar saying that you don't
have access to this microservice, so you can't
perform any of the emergency actions on this, and because of
which they don't attempt to do a rollback anymore on these kind of services.
But keeping that aside, having a target
of 40% makes you look really bad before the upper management.
So we wanted to loosen the SLO from
a four nine to literally 45% initially until
we made this recent change. And this is a very
drastic change on how your product SLO should be.
And this requires a big leadership buy in. Like, we had to convince
the upper management that this is the case, that we don't want to have a
four nine target, we instead want to have a 45% target.
Because most of these errors are user cost errors.
So convincing the upper management and the stakeholders for the overall product by
itself requires a lot of leadership and a lot of stakeholder management,
with also keeping the interest of product in mind. And this was yet
another place where our team's product manager helped
us greatly to navigate these conversations and
as I mentioned before, putting reliability and the consistency of the product upfront
and making them one of the major aspects
of any feature launches. For example, ensuring that your
feature launches are covered by slos or covered by integration
tests so that you invest as much as possible early
on to ensure that the product is reliable and it's working well,
are some of the things which we have developed in our team, and this
also builds a lot of resiliency within the product. So we
have recently developed a lot of feature launch requirements
for an internal SRE based feature launch as
well to ensure that our feature launches
for any of the SRE products or SRE feature based tools
is also governed with a lot of practices like
integration test slos and having the proper
emrs on who is going to be the EAP customers
and what is the market for the strategies and so on,
just to ensure that the product launch is pretty smooth and we have a very
consistent product going forward. The product doesn't get brittle
after every frequent launch.
Automation is yet another place which can
benefit a lot with SRE teams specifically,
I want to dwell a lot on this slide because recently
a lot of these SRE practices can be put
into four major themes.
CI CD is the major bread
and butter of every operations DevOps by itself, because that's
how the whole DevOps even started. They wanted to ensure that the CI CD
aspects SRE done by an operations team when
you're close to CI CD is the aspects of monitoring. You want to
ensure that your product, your services are
monitored well. The next aspect is
capacity planning, which is when you have any inorganic
launches. Let's say you want to view
your cricket matches and because of which there's a sudden spike.
I'm sorry, you want to view the cricket matches
and because of which there's a sudden spike in the traffic for, let's say,
YouTube and Google Ads. You want to be able to do the capacity planning
for these. And the fourth, obviously is the incident management,
which is when everything works well. Everyone is happy,
but when there is an incident of a very big impact,
everyone rushes to see what is happening, how can they
solve it, and what is the right status and so on. So we want to
be able to have a medium to communicate to all the required stakeholders
and people who are interested as to what is the state of the incident
and what is the way they have been taking to fix it, and is it
mitigated or not? And what is the impact it has right now.
So these four aspects of CICD, monitoring and capacity
planning and incident management are the four major themes across
all SRE teams. So automating
these centrally helps us reduce the cost overall.
And this is yet another place where we have been very
lucky to be able to invest onto horizontal products.
For example, our team focuses on the CI CD products. This helps
us reduce the cost of maintaining CI CD platforms and CI
CD tools specifically for every single team, because now
they can focus on their vertical team based strategies.
And this concludes a section of having the ability
to regulate the workload.
So we want to be able to prioritize the work, we want to be able
to push back when there are unreliable practices. As I mentioned before, we want to
be able to sometimes say that we want to be able to focus on reliability
as such and not focus on feature development anymore.
And obviously the fourth aspect of blamelessness. Blamelessness is
something which is homegrown within SRE for most parts,
right? Whenever something goes wrong, we want to be
able to capture it well and we want to be able to make sure we
don't repeat the same thing again, having postmortems for every incident
and writing our action items and ensuring the postmortem action items
are fixed right on time and they solve the
problem at the root cause. And we don't hit any other issue
after that are some of the basic principles of SF,
because we have paid the price. Making it
blameless ensures that people put in all the thoughts as to what
all went wrong and what all have gone wrong and how
the system should be resilient to prevent those
issues from happening in the future.
Because human errors are really system
problems at the end, right? So failure is an opportunity to
improve and not to blame blunt
and switchbox. So instead
of pointing fingers, we want to be able to ensure the reliability
of the service and find out what went wrong and trying to ensure
that gets better over time. So we want to be able to improve the
mean time, to detect and mean time to repair of any failure.
Because if come things similar to that can happen in the future, we want
to be able to find that out as soon as possible, and in fact hopefully
not even find that out, because we have ensured that we have fixed them on
the postmodern action items. But if at all we catch them, we find that out
and we fix them as soon as possible. So to
recap, these are the four principles of SRE
and how we benefited with the product management in each of them.
So, to conclude my talk, I hope
I've given you a sense of how we benefited in our team with
a dedicated product manager in each of these four principles, and how these
four principles can be seen through the lens of product.
And what I've seen in a lot of teams within Google as well,
is that the senior members of the team,
also the people who are into the leadership,
technical leadership, and management, generally, wear the hat of product management for
SRE, and any training through them in
the aspects of product management has been really beneficial because
they understand the users and they understand what the product can
do for the users. I hope everyone has got
the sense of the product management in SRE and hope you liked it.