Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Hello everyone. Welcome to my
talk in Conf 42, DevOps 2023 about baking in
reliability. Let me start by thanking the organization for
accepting my talk and thanking all of you for attending this
talk. So let's get to it.
So what sre we going to cover today? So we're basically going to cover how
you can start and keep the ball running about bake in reliability
within your organization. So we'll start by giving a
little bit of context. I'll describe the first thing I do when I
try to approach this kind of thing and then we'll go through some very useful
tips that we can do to get thats ball running. And then
at the end we're going to do a quick recap of what we learned today.
So how did this all started, all this commotion about reliability
engineering and SRE? And it's mostly due to this book
in particular. So in 2016, Google came out
with this book about site reliability engineering and it described the was that
Google did operations. This is what was very interesting because
we have been dealing with the DevOps movement for
a few years now. And although the
DevOps movement started to gain form around 2007,
2008 and the term was coined around 2009,
many organizations still struggle to implement what DevOps is.
Because DevOps is a culture, it's not a set of practical things.
From its start, it can be quite hard for some companies to actually
understand what they need to do or what they could do to implement DevOps.
So it wasn't any surprise when Google came out with the site
reliability engineering book, which describes both an
engineering practice and a position that companies started
to look this as a new approach to do things. Also on
top of that, it's a book that describes how Google does things,
right? So Google is one of the most successful companies in the entire world,
and it's very appealing for companies to just
look at what these type of companies do and start to implement the
same things. The problem is that this book covers
a lot of stuff. So it can be quite hard to really understand what's
going on and it can be quite hard to adapt that
reality to our own reality. And that's the other problem,
is that Google has a set of challenges that most companies won't
have and the way that they have to deal with that might
be different. So it's sometimes hard to make that transition from the way
that Google or any other big company does things to
our reality. So we need to have a set of things that we can do
to actually approach that in that way.
So how do I first start when people
want to talk about reliability or implement reliability.
So the first thing that I do is ask this set of questions.
So I approach people and ask what does reliability mean
to you and what does reliability mean to the organization?
So what am I trying to achieving with these couple
of questions? First of all, I want to understand people's expectations and what
they think that reliability means to them. And simultaneously
I want to try to find out if there is a common language
to talk about reliability among the organization and if there
is a mismatch between what people think reliability is and thats the
organization wants reliability to be. By having that
sense, we can approach the way that we implement reliability engineering
in different ways. Of course you need to ask these questions to many,
many people so that you can try and figure out if, even if different
clusters of people within your organization have a different sense of what that
means. And then that can help you approach the way
that you're going implement a reliability culture.
So let's get to some useful tips on how to approach this baking in reliability
thing. So the format that I'm going
to try and do these tips is that I'm
going to try and to describe why this tip is
important, how we can implementing it and then give in the thats section
actually useful tips and practical things thats you can do on a daily
basis. So the first one might not come as a
surprise and it will be will
be common across trying to implement whatever practice
that you want. And is the concept just asking questions?
So what were we trying to do with this asking questions tip?
So we want to understand the pain points that the company or the engineers
or whoever works in a company have. We also
want to understand the current perceptions that people have about reliability
or even just why things are the way they are. And also
at the same time you want to understand where you can contribute, where you can
put some effort to actually improve people's lives.
So how are we going to do this? We want to increase our knowledge based
on the experience of the people that are already working in the company.
So very important, what can we do? We mustn't be afraid
to ask questions, no matter how stupid they might seem or no
matter the forum. Even you could be in a meeting or you
can be in an incident. Don't be afraid to ask questions.
Something that I like to do when I join an organization or I join
a new part of the organization is to use the career
cold Start algorithm, which was popularized
by Google's, by Meta's CTO, and essentially
a set of three questions that you ask in 30 minutes meetings.
So you book 30 minutes meetings with a few people and then you ask
three questions. First of all, usually you ask,
what should I need to know to do my job? And this could be
for you, in particular a team that you're building, or a whole cluster,
or even a practice that you want to implement. Then you
can ask what they think will be the difficulties that
you will encounter doing that. And you list them out.
And the last question is, who should I talk to next with
those three questions? When you start talking with many people,
you will start to encounter common things. So you will start to understand what
are the pain points, what people think reliability is, what are the difficulties that you're
going to encounter. And you will start algo building a graph of people and
start understanding who does what within the company and who you should
talk with. The second
tip is to expose yourself. So the idea
here is thats you will try to the why for this
part is fairly similar to the ask questions.
You will also want to understand the pain points and encounter the current
perceptions and were you want to contribute. Again, you want to gain
your knowledge by the experience of others,
but you also want to build empathy. You want to put yourself on the shoes
of other people and really feel what they feel and have to
deal with what they deal. So what can you do? You can participate in
many events. It could be meetings, it could be incident
bridges, incident reports. Whatever it is makes sense.
With time you will start to filter those out. Some of them you can expose
yourself and understand. Okay, I might not be the best
use of my time to be here, but in the beginning, just expose yourself,
be there and try to understand how things are. One very important
thing is that you should listen a lot more than you speak and
of course ask questions. You should be there in observing capacity and
trying to understand things. Just listen. Listen what people say,
observe what they are doing. When you don't understand something,
just ask questions and take a lot of notes. If you're anything
like me, you will forget most of the stuff. So I just take as many
notes as I can and then I can try to summarize.
As you might have figured out by now,
exposing yourself and asking questions go hand in hand.
So you can expose yourself and ask a lot of questions. And the other way
around is also true, although that being said,
I wanted to highlight this separate because you could ask questions
without deliberately exposing yourself and you can deliberately
expose yourself and not ask questions. So these two actually go very well
hand in hand. So it's two things that will actually
will help you progress a lot, especially in the beginning when you're starting
your practice or you're starting in a new company, because you will gain knowledge
from the ones that sre already in the question in the company
next. Also very important is thats you want to educate yourself
about what this reliability thing might mean.
So you want to understand the fundamental concepts around reliability
engineering and you also want to understand thats other companies are doing right.
You want to understand how company X does this, or maybe Google does this,
or maybe metaverse or even companies that are within
your region. So how are you going to do this? So you're going to build
your own foundational knowledge. You will have to read books,
watch videos, do courses, attend events. So here are a few tips
in the thats section. So the first one is the site reliability engineering book,
the one that I described in the beginning. Although there is a lot of information
there, it's like the Bible for site reliability engineering and
there's a lot of information there thats could be useful
for you and for anyone who's implementing reliability engineering.
Second book, very interesting book is the implementing service level objectives by
Alex Algo and talks about slos and how you
can implement a reliability culture and have a reliability
framework to actually measure and assess your reliability.
The third link is a course on Coursera. It's about reliability
engineering. It's taught by a few engineers
from Google and it's more practical. You'll have videos and you have assignments that
you can do and start building that reliability engineering foundation.
And of course was important as the other ones it's just
going to events, it could be conferences, could be meetups. It's very interesting,
but you can interact directly with people, see what they're doing,
ask follow up questions, and more importantly you can see what worked
for them and thats didn't. So it's very useful to just go to meetups,
find one in your geographical area, go there and talk with people.
Something that you can do in the beginning is immediately start
to attack pinpoints thats the company has. So why would you do that? So you
want to alleviate issues that are hurting engineers and the business as a whole.
And you also want to create space for project work. So you want
to alleviate some pain points, reduce some toil so that you can actually focus on
long term sustaining projects. And of course you want to make yourself
valuable. So from the get go you want to increase your
value to the company. So it's good that you help attack painting
points. So again, so how do you do this? You will address your organization
pains. So what can you do in practice for this?
You have to make a deliberate effort to identify recurring
issues. So it might mean that you can go to
events or meetings just to understand what's really hurting people's
lives in a daily basis. And after you identify
those, you will have to schedule work to address them. One caveat
here, if you're not careful, there's a serious risk of
that being the only thing that you do. And people get used to this new
team or this new part of the organization, just years with this kind of stuff.
And if you want to create the space for project work, this needs to be
time bounded, for example, or very contained. So you address some pain points,
but you need to create the space for project work.
Another important aspect when starting a reliability engineering
practice is to make yourself available. So you want to build relationships
with other teams, you want to reduce friction with
communication, and you want to make people comfortable addressing
you. For just discussing reliability in
practice or because they have some kind of trouble, they need to be at
ease to just coming to you and ask questions. So you want to create
easy to use communication channels. How can you do that? You can
create some communication channel, like a slack channel, for example,
where people just can ask questions and someone from the team answer them.
You can also create office hours. It will be, for example, a schedule hour
every day, every week where someone from the team is
just there and people can just pop up. It could be physically or it can
be a Zoom call. People can just pop up and ask questions.
Another thing is to have, for example, a mailing list where people can just send
questions or want to discuss something and people from the team just
answer. Overall, what you want to create is an open door
policy where people feel at ease just coming to you and asking questions or
discussing a theme around this part of operations and reliability engineering.
Also important is to communicate extensively. Why would
you want to do that? So everyone should contribute to the
reliability efforts in the thing in the company.
It doesn't make sense for one specific team on its own to
do reliability work. So it should be a collaborative effort.
And for something to be addressed, people need to be aware of it. You need
to keep the topic of reliability fresh. You don't want people to think
that reliability is just something that one team does or
that all the work has been done and everything is fine. So you want
to make sure that reliability is present in people's minds. So how
can you do that? You can do talks both internally or external.
You can create a newsletter where you periodically send information about reliability
or efforts thats are being done within the company. You can
create documentation that people can refer to when they are doing their
work. You can create blog posts, you can create articles
that you share within the company or outside with things about
reliability in operational work. And of course you can send periodical surveys.
For example, if you have, let's say you could have a reliability
maturity model and you periodically can send surveys asking
for your colleagues how they think their operational
work fits within that model and that will keep the reliability efforts
in people's minds.
Something very important is to make reliability work visible.
So reliability, as we said in the previous slide, shouldn't be something obscure
or something that just the team does. If people know what's
going on, what people are working on, thats the efforts are going to be,
they will be more comfortable about it because it's not something obscure that just
someone is doing and that will then somehow
will impact their work and it will also open the door for other people to
collaborate. So you want to turn reliability work as a first
class citizen. So the first thing is to make it visible.
People should be able to quickly find out what work is being done in round
operations or reliability work and understand what is
being done. So you can track it any way you
track any other type of work. For example, if you use tools like Jira
or Trello, you could use the same thing. You could create tickets, you could have
epics, so you could use a similar set of tools that people already understand and
sre used to using. And you could use similar
working models. For example, if the company is using an agile development process,
be it scrum, combine, or whatever it is, you could use a similar type of
working model so that people can quickly understand what's being done, when it's going to
be delivered, et cetera.
Finding your niche is especially important when you're starting out in
a company that already has some size. So many companies
have different teams already addressing operational work and
it doesn't make sense to have competing efforts because that will be just a
waste of resources. And of course, competing efforts
create bad incentives and promote a bad culture. So we
would want to find reliability areas not being actively addressed
and tackle those. So make sure that your work doesn't
overlap too much with other teams. You want to identify
and stop competing efforts as soon as possible and you want to create
a collaborative culture. So maybe you have a cloud engineering team or
a DevOps team that is focusing on a particular area.
It doesn't really make sense to go there and try to change all their work.
It makes a lot of sense to focus on something else. For example, in observability,
you don't have a team focusing that in particular. And then
when some work will overlap with those teams, you will want to collaborate
with them and not replace the work that they're doing.
Very, very important for any effort and reliability engineering
in particular is to promote independence. So teams
should be as autonomous as possible. Your team or
your practice shouldn't be the bottleneck you need to allow people to
learn and build things on their own. You would want to build
tools that allow people to progress on their own and make and
gain traction. You can do that by providing documentation,
both written and videos. You can build tools, you can build platform,
you can build Clis, you can build bots, et cetera. And of course you
can do training. It could be follow training, for example certification,
or you can do your own internal workshops. The idea here
is that teams can independently
the traction they have, the necessary tools, they have the necessary documentation
to progress on their own, and they don't have to wait for you to do
some kind of work.
Talking about more specific about reliability
itself, it's very helpful to have a reliability framework.
Why is that? By having a reliability framework,
you want to have a way to define reliability. Also,
you want to have a way to measure and assess reliability.
It will create a shared language to talk about reliability between
teams and it will facilitate prioritization so
you can create your own or use a reliability framework.
One very popular at the moment is the use of slos to
gain more knowledge about that. You can read about slos in the SRE book,
you can read about in the SRE workbook, which is the second book that Google
released with more practical implementations. And of course you can
read the implementing service level objectives by Alex Hildalgo, which was extensive,
talks extensively about reliability. The idea here is
that different teams talk about reliability using the same language.
And this will help avoid conflict because,
for example, you won't have a development team saying that
this service needs to be reliable for
x, y or z and for example, operations teams talking about reliability in a different
way. And that will also of course facilitate prioritization.
Because if you have a way to measure NSS reliability, you can invest in
what makes sense. So if you're within your reliability bounds, probably you can
release more features for your system. But if you're below
reliability, the reliability that you have defined, maybe you need to
invest in more operational work.
As with any practice, having executive support can
be very beneficial. So executives need to understand why
reliability efforts are important and why they affect the business
along the way. Some changes could be hard and could
require push from the top, and it's also helpful to
align business with engineering. So you will do that by
being engaged periodically with executives. So you need
to interact with executives regularly. Very important is
that you need to back your claims with data. For example,
dormetrics are a good way to translate engineering metrics to
things that business can understand for being beneficial in terms of
operational work, and you want to connect reliability engineering efforts
directly to business outcomes. That way,
executives will understand the impact that these engineering
practices will have on the business as a whole.
And last but not least, very important for reliability engineering
is to invest in observability. Why is that?
You want to understand how users are interacting with your services.
You want to understand how happy users sre with your own services.
For that, you need to understand how your systems are behaving.
You want to improve your meantime to detection, you want to improve
your meantime to repair. And of course you want to reduce the change
failure rates. So you can do that by investing
in observability. So you'll need to equip
your system with metrics, logs, traces, stack traces,
continuous profiling, etc. There's a good
book about observability engineering released by the
team at Honeycomb, and it talks about what reliability is and how you
can implementing reliability within your systems. And you
can also use open source tools and standards like open telemetry
and even leverage automatic instrumentation to get something
out of the ground very quickly and making your
lives a lot easier.
So before we go, let's quick recap what we already
talked about. So first of all, we talked about context.
So we talked about why there's a sudden interest in reliability engineering
and SRE in particular. So that's
when Google released their own book. A lot of people saw that as a
new approach to doing operations and they try to incorporate
SRE within your organizations. But because SRE can be so
broad and so diverse and some of the things that might
work for Google might work within your organization, it can be quite hard to
make that translation between what works at Google and what will work within your
company. So the first thing I do, I usually ask two individuals what
they think reliability means or reliability means to them and what it
means within your organization. And that will help you set you the stage
for you to understand was the company or have the
people have an idea they have a framework in place or they don't,
and will help you understand if there's a mismatch between
what they think reliability is and what the company thinks that reliability is.
And then we went through some useful tips,
things that I usually do on a daily basis to help
me do this reliability work regularly.
Keep in mind that most of these tips are not done
once and then forgotten. These are things that you can help you start but
keep the ball running. And most of these things can be done in parallel,
so you can, for example, attack pain points and make yourself available.
At the same time, it makes sense for you to do these things in parallel.
And this is all from my part. I hope this talk was informative for
you. Each of these topics on their own thats that I
mentioned. These tips could be a
presentation on its own, so feel free to reach out to me during.
You could meet me at events or you could send me
a message on social and we can keep discussing these topics.
So thank you very much and have a great conference.