Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good morning and welcome to 42 site Reliability Engineering 2023.
My name is Ricardo Castro, and today we're going to talk about SRE
antipattern and one in specific, which is, which is one
that I call rebranding the operations team. So what do
we have on the menu? So we're going to briefly touch on what
SRE is, where it came to be, and what it
looks like today. We're going to briefly describe what are antipattern
and why the why of the SRE anti patterns. And then we're
going to touch on the SRE ant pattern that we're going to discuss today,
which is the rebranding the operations team. We're going to talk about the antipattern
itself and how to fix or avoid it. And before I leave
you, I'll talk to you about some more SRE antipattern that I usually do
talks about. So why SRE? So over
the years, computer systems have evolved in many different ways.
They increased in scale and complexity, gave birth to different ways
on how to develop and manage such systems in productions.
And the advent of the Internet and modern computing platforms needed a
new way and different engineering approach on how to run these systems.
So the 80s were seminal in shaping the Internet as
we know it today. In 1983 marks
the official birth of the Internet. Previously,
computer networks didn't have a standard way to communicate.
TCPIP provided that standard and allowed computers
in different networks to talk to each other. Then in
1989 at CERN, Tim Burton Lee defined what became known as
the World Wide Web. His goal was to allow the
sharing of information using the Internet. A year later,
he wrote the first client server, paving the way for the specification of UrIs,
HTTP and HTML. Then in the 90s, building on previous
successes, the late 90s saw the emergence of the Internet as
we know it today, and it marks the beginning of the first
big Internet companies. Some of these large Internet companies are
the ones that we know today, like Amazon or Google, and appeared as
Internet companies during this period. Soon enough, they started to
stretch the boundaries of what was possible. Until then, they faced new
challenges and started producing innovative solutions to these
new set of challenges. The continued growth of the Internet
quickly exposed other organizations to similar challenges.
The emergence of cloud computing, public, private or hybrid,
as the standard way to deploy and run services made it accessible for everyone
to manage their workloads and achieve massive scale. The traditional
ways to manage these systems felled short when dealing with this new
reality. And in 2009, the term DevOps was coined
with the proposition of hiring Ops who think like devs or devs who
think like Ops. SRE softwares
and systems architecture got more and more complex.
Scale increased considerably. Traditional ways
of running software started to fail again, Ops versus devs
and nonscalable engineering solutions. So in 2003,
facing massive growth, Google needed a unique way to manage their operations.
Having just joined the company, Benjamin Trainer Sloth,
currently VP of Engineering, was tasked with running a production
team to face these new challenges. Trainer Sloth
defines his team's approach as what happens when you ask a software engineer
to design an operations team? So this team evolved
and matured into what today is known as site reliability engineering.
So what are some of the principles of site reliability engineering?
So, embracing risk. SREs assume that operations
have risks, developing applications have risks, and putting them to
production has risks. They define slos. SLOS is
a framework that allows us to understand if our users are happy or
not, and allows us to define what reliability means,
measure that and assess it. Another principle is
the principle of eliminating toil. Toil is the concept of work that needs
to be done, but doesn't offer an endearing value. So SREs
take as one of their tasks to actually try to eliminate
as much toil as possible. Of course, monitoring.
Monitoring is essential for us to understand what's happening with our systems,
be it to evolve the systems or to actually fix
some issues. Automation is also very linked
with limited install, but it's the process that we developed
software that takes care of repetitive tasks.
Of course, SREs are also very intertwined with release engineering
because introducing changes to a system is seen as
a risk. So SREs are very intertwined with release engineering engineering
in the process of renaming changes into production. And of
course, SRE strive for simplicity. Their goal is
to provide tooling and to provide ways to develop
robust applications in the most simple way. So where
does it stand SRe today? So there's a big hype
from you a few years ago about SRE, and the main
culprit is the site reliability engineering book. And that was
released in 2016. So this book was released by
Google and it gave everyone a peek into how Google
managed their operations on a daily basis.
Of course, Google released two more books about this. Building secure system.
Building reliable systems also released a workbook.
So it gave the whole world an idea world, an idea of
how Google developed and managed their applications in
production, and gave them a blueprint to do that
kind of work. The problem is that these books has so much information
that makes it very hard for a lot of organizations to onboard all of these
practices or even some of these practices. And sometimes it makes it even
harder to actually see how the Google reality translates
into our own reality. Because of this, anti patterns
might arise. So what is an antipattern? So an
antipattern is just like a pattern, except that instead of a solution,
it gives something that looks superficially like a good solution,
but it isn't one. I actually like the definition
of antipattern by Maggie Fowler, which says an antipattern
is a solution that initially looks like an attractive road lined
with flowers, but further on leads into a maze filled with
monsters. So is this idea that we adopt something better? Because it
looks like it could be a good solution to our problem,
but down the line or down the road, it will actually be more prejudicial than
beneficial. So I'm pretty sure you
are familiar with this type of keyboard is the QWERTY keyboard.
But I don't know if you know, but why do we have this keyboard today?
And it's because of this, because of the typewriter.
So we might think, okay, we got so used to this layout
format for keyboards in typewriters. So it makes sense
that we use it for our computers keyboards. But why
did we settle on this keyboard for typewriters?
Mainly because of two things. So when
we pushed a button on a typewriter, a needle just prints
a letter into a
sheet. So that means that if we type too fast,
those needles could hit each other. So this layout actually
was intended to slow
typers down so that these needles don't actually hit
each other. So this was a pattern back in the day. Right. And we
translated that because we got good using this keyboard.
But actually, studies have shown that for speed,
this is not the preferred layout. Actually, as I just
said, this keyboard was actually developed to slow down a
bit writers, so that these needles
didn't hit each other. But because we got so used, that pattern that
we used back in the day actually transformed into an antipattern that we
have today. So translating this into the antipattern
that we have here today, it's when we talk about what
I call the rebranding the operations team, and it's when an operations
team is rebranded to SRE, but little
else changes. I can generalize this to rebranding
x to Y, and there are many examples across the industry.
We can think about agile or CI CD, and it's this idea that we give
a new name or a similar name to something new, but nothing
else really change. Maybe you're using a slightly different tool, maybe you're using
a slightly different approach, but on a day to day basis things actually
stay pretty similar. Unfortunately, this is the reality
for many organizations. A lot of organizations do this kind of
approach as well. They essentially give a team a new
name, they adopt a new tool, maybe they start using terraform or this,
they start using a configuration management, or even they start using kubernetes
or some kind of tool, but on a day to day basis the work
renaming the same. And it's not what SRE is all about.
So how can we fix this? First and foremost,
we have to educate ourselves.
So we want to understand what this SRE thing
is. And of course here are some good books that we can start learning
about. And we also want to start to understand what
is the perception for others of what SRE is,
what worked for them and what didn't work. Again,
Google is an amazing technology company,
but most of the things will not be applicable to our own reality.
So we need to educate ourselves and start bringing those discussions into our
organizations and seeing what is happening on other organizations and adapting
to our own reality. Another way to fix or avoid this is
just start. Why do we want SRE?
Why should we adopt SRE after
we have more or less understanding, reasonable understanding of what SRE
is? Is it fixing our problem? What is the problem that we are
trying to fix by starting with why? Instead of just going
with a what, adopting a new tool, just starting a new methodology,
just because someone else is doing it, just trying to understand
why we are doing this. What is the problem that we're trying to fix
or what is the thing that we are trying to improve that will make it
a lot easier for us to actually onboard
SRE and actually starting to pick the things within SRE
that make sense at each point in time.
Of course, some general guidance, the Pareto principle that says
that 80% of the results come from 20%
of the effort. This means that we
can actually look at the why that we are trying to fix and
put all of our efforts there. So we're probably able to
identify a few things within our organization and then map out SRE principles
or approaches that we
can put some effort that will bring us 80% of the results.
Also, a good idea to actually prioritize within all
of the SRE things that we can approach and onboard in organizations is
just to use the Eisenhower metrics. So the Eisenhower metrics
categorizes tasks by urgency
and importance. So urgent and important things are things
that we just need to do right. Maybe it's some kind of security fix that
we need to do in production. So we need to get that done.
If it's something that is urgent but it's not important, we probably
can delegate it. If it's not important, not urgent,
probably we don't need it. Where we can actually gain some time
is then when things are not urgent, but they are important because
of two things. First, not urgent, but important things eventually
will transit will make its way into urgent
and important. That means that it's the part
of not urgent and important that we can plan out and that we can gain
some time. We can plan the work around those things that
can give us some leeway in the future. And if you
have no idea about how to fix or avoid
it, always a good idea is to invest in observability. So observability
is critical for sres. Sres need to understand the
states that systems have put themselves into. And of course it's
good to have a reliability framework, something like slos.
And observability will be essential for you
to actually define, measure and assess what reliability means.
And of course that data that obviously provides
will then allow you to actually dig into issues that your
systems have put themselves into and will give you the confidence
that you can evolve your systems with
a lower risk of renaming it. And this
is just one of the antipattern. There are more. So the one that we
just talked about is the rebranding SRE team. But there are others that
organizations struggle with. For example, the lack of observability.
So they don't have enough observability to actually understand what their systems got
themselves into, not assessing user satisfaction.
So it's just trying to make systems
better, but not being really SRE if we are making our users happy.
Poor incidents management, common end pattern as well. We just fix
this fire and go on to the next. And we actually don't learn with
issues. Another common entity pattern is the SRe hero.
It's the person or the team that thinks and that assume that
they fix all the issues and they'll be the heroes again.
Another tenet of SRE is automation. So if you have poor automation,
you probably got yourself into an antipattern.
So a brief summary of what we discussed today. So we saw the why
of SRE, saw its origins and what it looks like
today. We saw what antipattern are and why SRE.
Antipattern are a problem for our industry.
And then we discussed the rebranding the operations team,
we discussed the antipattern itself and we saw how to fix and
avoid it. And at the end, we just saw that SRE, just like any
other practice, can have monitor patterns. And we saw a brief summary of
what those could be. And this is all for my part. I hope this was
informative for you guys. I hope you have a great conference,
and don't hesitate to connect with me on my social.
And let's keep this conversation going. Thank you very much and have a
great day.