Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE?
A developer?
A quality engineer who wants to tackle the challenge of
improving reliability in your DevOps? You can enable
your DevOps for reliability with chaos native.
Create your free account at Chaos native. Litmus Cloud
hi, welcome two my talk about evangelizing the SRE
mindset and building a culture of reliability and ownership.
I'm Christina. I moved to the US from Columbia when I was
seven and I studied at the University of Pennsylvania.
I majored in systems engineering and computer science.
I started my career at Bridgewater Associates. It's a hedge
fund in Connecticut and while I was there I was
building a portfolio analytics platform and an editorial
news site. Afterwards I joined Cortex as a founding
engineer and these second hire and I've been partnering with engineering
leads and sres ever since. Cortex is building a tool to
help teams manage their services and define best practices
across their organizations. So enough about me,
let's actually get into the talk. Today we're going to be covering the
changing role of the SRE at leading technology companies and go through
best practices to enable engineering organizations to
deliver reliable, performant and secure systems.
So how do Sres define theyre role? If you
look at the quote above, you'll see that the focus is
efficiency, scalability, reliability and
systemically delivered solutions. And so how do
engineers define the SRE role? Based on this
quote, you'll see that there's a focus on help with daily technical
challenges, providing tools and metrics,
performance resiliency and management visibility into team performance.
So these definitions are pretty similar. That's great. And obviously
I don't want to overemphasize the findings in the
sample of two quotes. But if you just bear with me for a moment,
you'll see that both sres and engineers emphasize
a focus on efficiency, scalability and reliability.
Let's unpack where these definitions actually differ.
Right. The engineer seems to have a broader definition of the
role than the SRE does in this particular example.
Again, just bear with me with the two quotes.
Both mention visibility into how the system works and fits together,
but the engineer uses the word daily, which I find particularly
interesting. They also mentioned that the
SRE team is a conduit between engineering teams
and management, and that's different.
And that has changed recently. Before when we thought about SRE
teams, we would think of them as these Kubernetes expert or the AWS
expert. Those days are pretty much over.
Today's leading technology companies are empowering sres
to actually level up their engineering organizations and have
impact on how the engineering team works. And what
it means to work together. So what
does it mean to foster a culture of reliability and ownership?
One of the first things that I learned at my first job was who our
systems were for. There I was working for
people. Sorry. There it was for underresourced
state pension funds, for teachers and firefighters who relied
on our system to make allocation decisions that would impact millions of
retirees or sovereign wealth funds charged
with the financial stability and security of nations. Someone turned
that expectation real about my systems. We operated
under the tried and true four nines principles.
That allows for less than an hour of downtime a year.
That's 53 minutes of downtime, to be exact.
The first time that I brought down production, it was a gut wrenching feeling.
I had this in mind, and it absolutely sucked. I knew that
I let down my team. I let down myself, I let
down my manager, but also I let down the analysts at our company using
the tools, and I let down the clients, the people that actually
depend on our systems to make their decisions. One of
the analysts I worked with described poor performance in our application
as being sent into war without a sword.
And so every time it brought down production, that's kind of what I
had in mind. It's that terrible feeling that, you know,
that you let people down, that you
cause the problems, and so what do you do? How do
you handle these situations? Right after the dust
actually settles on these incidents, the app is back
up and running. We've had a day or two to let down our emotions
and just let it all die down. It's super important to get
together and talk about what went wrong, and that's exactly what my team did.
We'd pull up the code that caused issues and ask,
who wrote this? How was it tested?
Should our automated testing have caught this? Who code
reviewed it? Should they have code reviewed it? Were they the right
people to code review? Why wasn't the problem caught in a staging
environment? Why wasn't it caught in post production?
Who released it? Who went through that checklist?
That post validated the release? And we did
this in a way that didn't point fingers or make people feel bad.
The whole point of this process is to bring the teams together
so that you could learn, evolve, and figure out
how to prevent this from happening in the future. People should
worry about letting their customers down, not about getting yelled at by their
boss or having a bad performance review. At the end of the
day, it's all about the customers and your users.
As a junior engineer, knowing how these problems would
be handled when they did occur because we all know that they're
going to occur. You're never bring to write perfect code actually gave me the
space to develop without letting me off the hook or relegating
me to bad tasks that
I maybe didn't want to do. So ask yourself,
does your engineering team know why the application matters?
Are there explicit expectations about uptime?
Do your engineers know how problems are handled in
the organization? If you're not sure about the answers
to any of these questions, or maybe you're not sure
that the junior engineer that just joined your team six months ago knows
in the same way that someone that has been there for two years, it might
be a good time to actually get your team together and talk about this problem
and just make sure that everyone's on the same page. It's even a good thing
to do every couple of months to make sure that expectations haven't changed.
The team is operating as they should.
So how can you apply this to your delivery machine?
The first and most important thing is ownership. If there's
any confusion about who's responsible for what, you have absolutely
no chance. If there's frequent problems with some part of the
system, someone's neck needs to be on the line to fix it.
Who is both empowered to do so and knows that they're responsible
for it. These owner of each piece of the application is
also responsible for making sure that documentation is up to date,
runbooks are clear and easy to follow, and dependencies are well defined.
We all know that writing documentation kind of sucks, but in
a few minutes I'll go through a case study about why this is so important.
With that said, you can also build safeguards around these workflows to keep ship
from shipping. Sres need to go toe to toe with product
managers to make sure that the work required to build a robust
delivery system isn't getting kicked down to the backlog.
It should be prioritized just as much as new features are prioritized.
Finally, when things go wrong, talk about it. Get your
team together, follow the principles of agile and actually have
retros go through what went wrong. You can use
the questions that I went through earlier, go through what the solution was,
go through how you can prevent this problem from happening again,
and make sure you know who was responsible. This retro
and going through these problems should be a mundane part of just day
to day doing business. It should become as much
of a non event as prioritizing Jira tickets. In that
way, your team knows exactly what to expect when things go down.
And again, the point isn't to blame someone.
It's to figure out how to work together better as a team, how to evolve,
and how to prevent these things from happening. So let's
go through a case study.
Say you're the engineer on call. You went
to bed, it's 03:00 a.m. And you get woken up.
It's the middle of these night. You're super groggy, you're tired,
you've been sleeping for a few hours, but it's not enough. And so you
look at this message and you have no idea what it's referring to, what the
service is. You grunt, you grumble
little, and you pull your laptop into bed with you
and open it up. It's 315. It's been 15
minutes since you got the call, and you can't find these service that's down.
You have no idea how to find it. You've never heard of
it. You've looked everywhere you can think of. You keep
looking through the documentation, and you finally find
the service. But there's no documentation about it.
There is no runbook. There's nothing that you know that can
help you get it back up at 345.
You decide that because you can't find the service owner, and you don't know who
to contact in 45 minutes. You just want to go back to
bed. You call a different engineers on your team.
The Mr. Fixit, we all know him. He's the one that gets called every
time there's an incident, every time something goes down. The one you
go to, every time you have questions, Mr.
Fixit answers your call. Thank God. And then
you both work together. He hasn't seen this before,
but he thinks that restarting the app will help. It's happened with similar
services, so that's exactly what you do.
You figure out how to restart the app. You're both there.
You wait for it to come back up, and after
15 minutes of monitoring it, you go back to sleep.
At 10:00 a.m. You log back onto your computer
for your daily stand up.
Everything's up, everything's running. No one realizes that you and Mr.
Fixit were up at 03:00 a.m. Actually solving this.
And in a month, when this all happens again, because there
was no retro, there was no talk about fixing it,
there was no talk about preventing this from happening again.
You're the one on call once again. And this time, when you go to call
Mr. Fixit, he's no longer there. He quit to go
work at a company that actually cares about reliability. He was done
with the 03:00 a.m. Wake up calls.
So not a great scenario, and not one
that personally, I want to be a part of. Right.
So how do we actually take the principles that I
laid out earlier and apply this to how the
incident machine should go and how teams should actually handle this process?
So again, let's reset. Pretend again,
you're the engineer on call. You went to bed, you get
woken up at 03:00 a.m. You're groggy, you're annoyed.
You open up the messages, and you look at
it. You open up your computer, and you go straight to the service
with that name. This time around, there's documentation.
There's a clear process to figure out where
the logs are, where the runbook is, what's going on. So you look
at those logs, you determine that the app needs to be restarted.
You pull up the runbook to do so, and after 45
minutes, the app is back up. Talk about a difference,
right? It's already taking less time than it had in our previous scenario.
Moreover, you didn't call anyone. You didn't have to call Mr.
Fixit this time around. You were able to actually
fix this yourself without necessarily knowing what these service does,
what it is, and having worked on it before,
you go back to bed after monitoring for 15 minutes, and then
at 10:00 a.m. These team responsible for the service gets together
to actually prioritize a fix for the issue so that next time
this happens so that it won't
happen again, and that if it does happen next time, they can
figure out why and what went wrong. But you know that they've tried to fix
it before, so why does this matter?
Obviously, it's bad for morale. It's bad for the engineering
team. You don't want to be woken up at 03:00 a.m. But it also
is super important to the business.
So stripe actually published
a survey in 2018 about engineering
efficiency and its $3 trillion impact on
the global GDP. And so, obviously,
your engineering team will be annoyed by incidents and
the morale will be down, and you should prioritize fixing
it. But in terms of translating this to leadership,
these study found that the average developer spends more
than 17 hours a week dealing with maintenance issues.
That involves 13 hours a week on tech
debt and 4 hours a week on bad code.
And so if you actually translate this into the cost,
you'll see that in the average work week of 40 hours
a week, if you're spending 4 hours on bad code,
that's 9.25% of productivity,
and that equates to $85 billion
a year. And so
let's think of this in the same exact way that
leaders can understand, right? Bad code is
debt, and debt is something that all engineering,
all leaders, regardless of their technical abilities,
understand. We all know that there's a few ways to handle debt.
You can refinance, you can pay the monthly minimum, you can
take out another loan, but eventually the debt collector is always
going to call, and those bills need to be paid. This framing
is exactly what you can use for managers who aren't
technical enough to understand the problems that they're dealing with when you talk
about tech debt. So, practically, as engineers,
we have three approaches that can go into solving these problem.
One, we can pay down the debt slowly over time by carving
out engineering capacity for it. This can mean that you
put 10% of engineering capacity into every sprint
to actually handle tickets. You could also do one ticket per engineer
is a tech debt ticket. And that's a way to slowly carve
out over time and fix on that tech debt.
Option two is you can focus on it. You could dedicate a specific team to
fixing tech debt, or you can dedicate a specific sprint or
time period every quarter to actually eliminate tech debt for all teams.
Personally, I don't think this is a great option.
No team wants to work solely on tech debt. It sucks. We all
want to be working on the features that actually impact clients. We want to
see them use the tools. You want to see that aha moment when they're using
your product and you know that it works.
Moreover, the team who created the tech debt should work on
their tech debt. That's how you learn from it, that's how you improve, and that's
how you start thinking about how to eliminate it in the future as you're
building out new products. And if you
choose to go the route of doing a sprint a quarter,
these problem with that is that you don't want to pause feature development for,
say, two weeks to actually work on tech debt. And there's
always going to be a team that's working on something so critical that
you just can't do that. And so at that point, they're going to
be exempt. It's going to be easy to ignore it for a
few people, and it's not actually bring to necessarily help these situation.
These third option is, obviously you go bankrupt.
And this doesn't mean literally bankrupt, right?
But it could mean a variety of bad outcomes. There could be security breaches,
you could be having lot of downtime rather than
the 1 hour of downtime a year, you could have an hour of downtime a
week. You could also have performance degradation,
which again is almost just as bad as your app being down.
And so basically this is a good framing for people to actually
think about how to handle bad code and how to handle
these tech debt and how to get it prioritized among your engineering leaders
and your organization. So remember,
the simplest and most important takeaway from all of these
should be that you want to create an engineers culture
that cares about the users. And behind the users is the
reliability and stability of the application,
management and teammates who embody these values.
SRE incredibly important good onboarding
for new engineers that makes requirements and expectations explicit
is key that new team members should understand how
incidents are handled and how the onpaul process works, even if
they're not necessarily on it. You hope that in six months
of time theyre will be on it and it won't be a huge thing
to get trained up, rather something that they already know how it works.
Having up to date documentation, run books, easily accessible
logs is key. Obviously my case study
that I did isn't a concrete example, but I'm
SRE. We've all been there, we've all restarted our apps to fix
an incident and these the trickiest
part to navigate here right in a spot where good technology can help
is actually defining ownership, having a place to
store all that documentation and information about services,
and helping map out the dependencies across complex systems.
And that's exactly what Cortex is built to do. I encourage
you to take a look at our website or send me a message and we
can do a quick demo. Um, but we've been helping engineering
teams and SRE teams for a while and the
tool helps you do exactly this.
Thank you. If you want to reach out,
my email is on the screen. Hope you enjoyed the talk.