Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to the Conf
42 site Reliability Engineering 2022 event.
Today I will be talking about SRE antipatterns.
We will talk about ten different antipatterns
that I have seen and that every day while interacting
with people who are practicing SRE.
The first that we are going to talk about is
renaming operations to SRE and continue to do the
same work that you do. Please understand
that sres are for a
specific objective. Google created SRE
because at that point of time, others was a need for tremendous
scaling and it was not possible to
use the traditional methods of maintaining
the operations.
And hence what Benjamin trainer's class had to do was bring
in the sres.
Sres are not there to do the regular work.
What operations does operations will continue to do it.
The main focus of SRE, first of all is about reliability.
And not only reliability as of today,
but reliability when it scales.
SRE will make a benefit only if there is
scaling in the organization. So they are looking
at a perspective from the future
point of view. And for that they need
to learn from failures. So SRE learns
from failures. There's a lot of psychological safety that is required
to implement SRE. People should be able to break
things, learn from it, because what
Google believes is that if something breaks,
it is the system's problem. For example,
if I am not in a good mood and I come to
the office and I do something which breaks the system,
it's not my fault. It's a system
which needs to be changed. Because if I could break the
system in a particular manner, everybody else
will be able to break it in that way. So the learning has
to come so that nobody is able to break the system in the same manner.
So we have to learn from failure.
And again, as I said, that scalability is
the main objective. How can we scale
at a very fast growing scenario? It can be a
scale in terms of amount of new
users joining in, like what happened to zoom after the lockdown
started, so many different people started using it.
It can be in terms of your unstability in the
system, so more and more increases happening for your incidents
and it can be in terms of newer technology
that you are going into. It can
be new features that are coming up very frequently.
Whatever is the way that scaling happens, if there
is scaling, there is need for site reliability engineers and
they have to focus on that, not running the day to day
operation. So do not expect sres
to do your regular operations work in terms of monitoring,
in terms of doing some automation
to do your regular work. They are going to do it,
but for example, they will automate to reduce
toil, to free up time for people, so that they can focus
on making things better, on increasing
the reliability of the system. And believe Google
in their book, the latest book, which was published in 2020 January,
called Building Reliable and Secure system.
Google believes that a
system is not reliable if it is not secure.
So they have to think from all of these perspectives.
So don't just rename your operation and continue to
do what you used to do. That's not the case. And that's why
the skill set required of an SRE is also not the same.
The second one that we see,
users should not notice an issue before you
do. A lot of times we sre that there is problems
that comes up in production and users gets
to know of it before we get to know. So the
first thing that we need to do is identify proper,
appropriate slos and
then define the alerts, which is
actionable. We lose ourselves
in the noise of alerts. That happens.
Everything doesn't need to be alerted,
everything needs to be logged, everything needs to be traced, but not
alerted. We all have got onto the aircraft
and in the aircrafts, when we are standing there,
before it takes off, the cockpit door is open and we can see that
there are hundreds of different meters that are there.
Do you think that the pilot and the copilot looks at each and
every meter that is there on their dashboard?
No, they look at only a handful,
four or five different meters that they need to look at
to ensure that everything is going fine.
You can think of the Dora metrics, you can think of the Google's
golden signal, but we
need to capture everything that is happening. So what we
do, we look at these signals.
If anything is not working as per
expectation, then we go to the others relevant to that to
diagnose the problem. So alerts has to
be on the service, not on the metrics.
Obviously, we need to increase observability,
we have to know each and everything. And observability
here we mean melt, melt metrics,
events, logs and traces,
everything we should be able to know and
we should be able to detect faster, because to us the objective
is MTTR. Meantime to recover should be shorter
and for meantime to recover to be shorter.
We need meantime to detect to be shorter. If you need
meantime to detect to be shorter, you need to have better
understanding of the system. The end to end domain knowledge.
You may not be a subject matter expert, but that knowledge is required.
You may need to have a better understanding of your it
system and your entire journey of
the different customer personas. It is not a one size fits
all approach. We need to know what each one is doing, what is normal for
each of them. We need to bring
that observability into it. And finally,
we need to have better fault tolerance to achieve
the slos that we have promised. And that is from
the point of view of the customer experience,
what we are doing.
So we need to increase our
monitoring, our telemetry, our application performance management,
all these and move towards observability so
that we can get each and every aspect when we
need that information, but not drown
in the noise of alerts.
The third one is measuring until my age. Now this is a common traditional
scenario where we are bothered about a 99.99%
availability of our servers. We are looking at
latency and all but all from within the four
walls of our organization. The things that
we are in control, but that is
not okay. Customer is
happy only when their experience
is better. I can have a 99.99%
availability on my servers, but I
am going through some last mile network
which the customer is using a mobile data which is giving
a 95% availability. Now that
is not giving the customer a 99.99%
availability. And that's why having the appropriate
level of Slo is very important. We should
never target 100% because it is not possible unless
we are a manufacturer of pacemaker or
something like aircraft, that it should work.
Most probably everything
else doesn't need to be 100%.
So we need to understand what is normal, what is the customer
really looking at, what they really need
and joint become. Sres work together with
the product, others with the users to define the slos
based on facts and figures. What is happening today.
So again, observability is important to even have this discussion
we need SRE are responsible for better customer
experience. And today customers experience is not just
delivering some components or it elements.
It is a journey that they sre looking at, as I
call it. We are looking at servitized products.
Today we don't buy songs in a CD.
We buy a musical experience that also on a pay as you
go model from a Spotify.com.
And the entire journey is important for the customer experience,
not a specific song. We look at
how easy is it to search for the different types of song,
the different genre of songs that we want to listen.
We look at how easy it is to run it,
whether I am in the house, whether I am in the office, whether I'm on
the car, whether I'm in the camping site, am I able to do it?
Can I use any type of devices to listen for it?
What is the kind of recommendations that
the system is throwing based on my
listening habits? Can I pay as you go?
Can I pay only when I'm listening and not every time.
All these together gives the customer experience.
If we look at Uber, Uber is not delivering a cab
service. They are giving us a travel experience.
And that's why the customer is looking at that entire journey,
starting from the time they are booking the cab till the time they
pay the money and get off the cab.
That entire journey has to be good.
That is the kind of thought process that we have to start looking
at. Sres have to look at and then find out
ways to make it better. That is the job of SRE.
Not again, not running your day to day operations
work. And obviously we have
to look at end user performance. We need to look at
the end user performance not from within our edge,
not from within what is within our control, how it is
happening at the customer site. There are tools
like Catchpoint which allows you to do end user performance management.
We need to look at the web analytics,
how fast the page is opening
up or how slow it is opening.
We need to look at it from that point of view,
keeping in mind that the customer, how they
are getting there is a situation when I was talking
to the founder of Catchpoint, Mehti Daudi,
and he was mentioning about a situation where they had
seen that there was a speed which was slowing down
in an AWS situation of their client in
California. The client put the customer
complaint in AWS. AWS sent them the
server logs and said everything is fine now.
The customer then sent the catchpoint report and then they looked at it
and took five days for AWS to find
out where the problem is. The traffic from California was supposed
to go to a north american server.
It was not going to that. Some change had been done
earlier because of which it was going to some server in Asia back
now. If you now look at the server, North America
server and give the statistics of that, the details of that,
it will not show anything. But is your customer
happy? Is your customer getting your promised service?
No. So we have to move out of our
edge and go till the customer.
That is what SRE is going to look at.
The next important point is false positives
are worse than no alerts. And today we find
that with the traditional monitoring and all, we find a lot of
false positives which are going and ultimately affecting
us in our production. Customers are finding those
problems. So we cannot look at
individual host alerts and think that everything is fine,
even for that matter, if we are looking at the HTTP request,
if you look at from the server point of view, is one thing,
but what happens to those HTTP requests which are failing even at the
load balancer level? So we have
to look at everything starting from the
customer experience, and then go backwards to see
what needs to be done to achieve
what the customer has been promised in
terms of slos. So alerts
has to be very, very specific with respect to the services and
not with respect to some components. Yes, we will have to
track different things in the component side,
but the ultimate objective has to be about the service
that the customer is getting. Response fatigue and
information overload of time series data is not good. So as
I said, too much of alerts is not good. If people have to
keep moving from one to the other because the pager
is ringing very often, it is not a good thing.
It is actually reducing productivity.
We cannot do multitasking. Our brains.
Neuroscience has proven that our brains are not wired
to do multitasking. We do things faster. We think we are doing multitasking,
but that's not the case. So we need
to look at only actionable
alerts, not look at everything and
get disturbed. Every time we are moving from one to the other,
the brain is taking time to shift.
Alerts should have great diagnostic information that
is again very, very important. Just telling that something is wrong
is not good enough. So when we are putting the alerts, we have to make
sure that at that point of time we can collect as much health
related information of the system and pass it on in the alert.
So that whoever looks at the alert, looking at it is
able to faster diagnose, that is
lesser MTTD, which leads to lesser MTTR.
Because if MTTR is more,
that means meantime to recovery is more, that means your outage is
longer. That means you are going to eat up on your error budget.
You will have lesser time to do things, better things
like releasing new features, like putting some
security patches, like doing chaos engineering. You will not
have time if you are having outages
already from something else.
The next is configuration management trap.
So traditional infrastructure is not suitable in today's
world, because there are so many moving parts, there are so many different things.
It is so much of distributed system. That traditional
way of doing infrastructure is not going to help. I remember long
time back when I used to, I started working. So that
time the desktops had those
motherboards which had only two slots for memory. So you
can put two mb, not gb, two mb
memory or two four mb memory, or two eight mb memory or 216
mb memory. That was the maximum that you
could do. So I had a four mb memory,
one in the slot. I wanted to make it eight mb.
Submitted the request, got all those approvals.
The engineer comes to me, opens my desktop,
sees there is a four mb slot, which is empty, and others is one
with four mb ram.
He opens the desktop beside me,
because that desktop, nobody was sitting there. Because the person who was sitting,
he has gone to a client location for a month or so.
He opens that, takes up out his four MB memory,
puts that ram chip into my machine, my machine
becomes eight MB, closes both the machines, goes away after
one month. When he comes, when this person comes, whose machine
was opened, his machine is not booting,
his desktop is not booting. And by that time, we have forgotten
what happened. Imagine that happening
in today's world with millions of different
components that are there.
It's impossible. So we have to make sure that
we get into infrastructure as code, configuration management
as code. We have to have a very,
very strong configuration management, not only because of the
stability, but also, as I said
earlier, that reliability means security, even for security purpose.
We need everything to be automated.
We need to move into that immutable infrastructure scenario,
the pets versus cattle versus poultry scenario.
Pets means it's a kind
of idea which
has been brought in. But the main thing is that we have to
servers, huge servers. We have seen those servers
with names like John and Thomas and Paul
and so on. We are
so emotionally attached to those servers that we want to keep treating
those servers as much as possible so that it keeps running.
But that is not cost friendly.
That is not giving us the kind of result that
we sre looking for to satisfy the current
needs of the customer, the ever changing needs of the customer.
So we move to cattle, which is less
emotional attachment, more numbers, more work
can be done if the cattle is sick,
and we just put down the cattle and replace. So this is the
VM. But in today's evolving
world, it is not also good enough. So what we are
moving towards is poultry. Like chickens, huge numbers
can be put in one place, lot of
work can be done, less expensive,
and that is your containers. So we are moving
into a immutable scenario, with automation,
with containers. And immutable means that you cannot change.
So in today's world, we cannot change. What we can do is
replace. We kill the old one
and replace it with a new one. That way it
is also much more secure. It is much easier to
detect problems and rectify problems.
So sre don't spend much time on
changing. Rather, they automate in such
a manner to homogenize the ecosystem, to make it in such a manner
that it becomes much more easier for people to
take care of it. Lot of it automated, lot of it.
Getting into a self healing kind of a situation,
that is your work that
needs to be done. Next important
aspect is incident response.
We all know we are doing incident. Yes.
Sre also have to be part of incident response.
Couple of reasons. First, if they sre coming
as the expert, as the advisor,
as the consultant to the entire delivery lifecycle,
bringing the wisdom of production from the right to left,
they need to know what is happening on the ground. They cannot be an advisor
without knowing what is happening. So they have to be hands on.
But here the incident response is different.
First of all, SRE do not
look at a tiered support model. If you have
to implement SRE, you have to move from that level one,
level two, level three, level four support. It has to be one
single team responsible for the entire system.
End to end. We have to move from a project to product
approach. One single cross
functional, self sufficient, self organized team
where all the capabilities are there.
So we are looking at a comb shaped skill set of
the whole team.
And here, if a problem happens, if an incident
happens, you need to swarm.
So everybody that is there comes together, that is relevant,
comes together and solves the problem, because everybody
looks at it from the same point of view and
solve it. That's subject. It's not that handing
it over from one unit to the other to the other. No.
So no tiered support. We have to get into swarming now.
When incidents are bigger, we need to have a
proper framework for incident command.
And Google has defined an incident command framework with incident
commander and various other roles which you need
to look at, and that facilitates the
smooth flow of work. A lot of it
has to be automated. So SREs
looks for opportunities of automating,
whatever can be done, because on call anyway, is not something
which we look forward to. Do a lot
of it. Now, you can use chat bots, you can use ibRs,
you can use many automated system, you can create a lot of
runbooks to take care of it.
And what is also very important is the learning
from it. We talked about the learn from the failure. So if
the incident has happened, it's an opportunity
to learn. So how do we learn?
Through blameless postmortem. So SRe are the ones who are going
to facilitate and conduct the blameless postmortem with the people
who Sre actually involved during that, because they have the information
as to what happened at that point of time. What was the sequence
of activities that happened? What was their expectation, what were their
assumptions, what were the things that they have done?
That's a learning. And this blameless postmortem
has to be documented and it has to be circulated
to everybody in the organization, not just within the
team. We have seen incidents where
even organizations like banks have not only shared
postmortem within themselves, they have shared that to
the outside world in social media, because there is still somebody who is
going to be benefited out of it, number one. Number two, in a
specific case, as you see in a video by Monzo
based, one of the problems
because of that incident was it was due to some
open source product, certain aspect
in that which had changed in a new version which
created the problem. Now, people in the social media
were also people who are creating and maintaining those open
source codes, products.
They got back and said, great,
we have got this information. Sorry to hear this.
We will take care of it in the next change.
So everybody gains out of it.
SRe are involved in all of these.
As sres, we have to start thinking beyond point
fixing.
Point fixing means that we are looking at only the
problem from immediate point of view and solving it.
But Sres doesn't look at that. SREs looks at it in
a much bigger context, in a much
long term kind of a scenario. So minimize
outage with automated alerts and solid paging mechanisms and
quick workarounds, faster rollback,
failover and fixed follow. So you have to,
when you sre creating anything, you have to think through all of these, how you
can do it, how you can automate it. If there
is any new release, that new release will not be released unless
there is also
the script ready for rollback.
And all of these has to be tested,
analyze and eliminate the class of design errors.
As SREs, we have to design
and analyze what is happening and based
on that, automate things,
short term fixes followed by preventive, long term fixes leading
to predictive methods. So as I said, sres are
not looking at a situation, what is happening today.
SRE are looking at a situation in
future. They are looking to be ready for the unknown
unknowns, not only the known unknowns.
So they have to look at the observability, they have to look at everything that
is happening and also analyze and do
the what if analysis for the scaling that might
happen in the future. And be ready for that.
For example, something may work in your current
situation. Let's say a small latency delay,
but that same small latency delay when it is happening with
millions of users, can crash the system.
In your environment, there can be an automated
system where you are transferring some data from some
files from one point to the other, and then working
on that file, processing that file. But if
there is a delay when you are doing it over a distributed system,
the whole process gets stuck.
So those are kind of things that SREs are going to
look at. So aim for auto
remediation and closed loop remediations without human intervention.
What is repetitive, what can be done should
be automated,
what machines cannot do, like refactoring
of technical data. So like rearchitecting,
like bringing in new features. Those SRE things
which human beings has to do, people will focus
on that, the rest of it will be automated. That is what toil
is about. Production readiness
gatekeeper SREs are not the gatekeepers in
DevOps and Devsecops. We want things to move faster.
SRE complements DevOps. So SRE cannot be a gatekeeper
where it is stopping from faster releases.
So any process that increases the length of time between the creation
of a change and its production release without adding definitive value
is a gatekeeper that functions as a choke point or a speed
bump. So we have to make sure that
the whole activity is such that it is helping in improving the
flow rather than stopping the flow. So if
you are looking at a release and deployment, you are as an SRE
going to put that automation and release and deployment not only
on production, but also at each and every other
stages in every other environment. So that
that release and automation is tested throughout.
And it helps the users, it helps
the developers. So SRE will enable and enhance
the velocity like they will use the error budgets,
build platforms and provide dev teams with the development
frameworks and templatized configurations to speed up reviews.
So they will create the infrastructure as code, they will give
it to the developers, to the testers, so that with a push of a button,
they can create those environments and test it on
those environments. The same release and deployment automation
which they can use to deploy in each environment
to test. So SRE
shift left to build in resilience by design in the development
lifecycle. So they are involved in bringing that
wisdom in production, the wisdom of production
to the entire deliveries lifecycle. Starting from ideation.
They will help and guide the developers,
the testers, so that when they complete the work in
that stage, it is something which is deployable
in the production, and it is not going to affect the
production in a negative manner.
As I said, everything is a systems problem,
not a human error. So SRes will strive
not to have a cause of an outage repeated. That means there is
no learning. The desire to prevent
such recurrent failure is a very powerful incentive to identify causes.
So sres are responsible for reliability.
That means the same failure should not happen for
the second time. That's the learning that they are talking about.
One of the challenges that we see about root cause analysis
is that the root cause is just the place where we decide we
know enough to stop analyzing and trying to learn more.
Sres don't stop there. Sre try to see. Okay, fine.
Now we have understood the problem in the current
scenario, what in future?
What in a different scenario.
So we have to see that what will
happen in the future, even with our
root cause, and continue to find out ways
to make the system more reliable.
So we have to move to think about the contributing
factors if we know what happened, where things
went wrong. Let's explore the system as a whole and
all the events and conditions leading to the outage. Again,
this is related to the point fixing. We are not looking
at only point fixing the current problem, but all things
that may lead to it, all things that this can cause
in future. And as I
said, it's always a problem of the system, not a human
problem. SRE is not
about only automation. This is again another kind
of thing that I see. Many organizations are taking more and more
software developers, because they have heard that SREs are
developers. So they are taking
in a lot of developers, and the only thing they sre doing is automating
a lot of things. The point is we have to be
very, very clear as to why we are
automating.
And that has to tie to the measurement.
Do we really measure what we are automating?
Our project is over successful the moment we have implemented
something. No, that is the starting point. The value
creation starts only when the users SRE using
the product or service that you have created.
If that is the case, then the
value creation starts only when the implementation of
automation is done. I've seen scenarios where people have
automated and then when they start measuring, they have found
that the automation has now created more problem
and it is taking more time for them to do the work opposed to what
it was earlier. We have to also understand
that there is a constraint of resources, constraint of
fund, constraints of time. So we need to
prioritize what is most important and automate
that SRE needs to do that prioritization based
on facts and data, not on
the basis of what we feel.
It is also important that we should allow the people who
are doing the work to decide on the automation to decide
on the tools that is going to be used and
for scaling along with the value stream
team. Along with the product team,
SREs can take the ownership of the platform,
the entire CI CD pipeline and the release deployment
up to production can be created as a platform
and SRE can provide that platform as a service to the
product teams to the value stream teams. That is
the best combination for scaling.
And why do we talk about SRE taking the
responsibility of the platform? Because today what
the developers are using the same tools is what the
operations is maintaining.
If developers are using kubernetes,
production is also having kubernetes. If you are using dockers
here also you are using dockers.
The entire definition
of done has changed. Definition of done has extended to production
as per DevOps. The work is not complete until
unless it is tested in production by actual
users. So you use things like a b testing, blue green testing,
canary testing, where is it done
in production with actual delivery
users. That is what sres are facilitating.
So sres have to look at in entirety.
So these are the kind of anti patterns
that I have seen and hope this will help
you to chart your path of SRE journey in
a much better way. Thank you.
And if you have any question, you can always get back
on my email, on my Twitter handle and LinkedIn.
And we as DevOps India Summit
2022 have partnered with Conf fourty two and that
event, that global event is coming up on 26th August 2000 and
from 08:00 a.m. To 08:00 p.m. India time. It's a free registration.
Join us also for more speakers
speaking on various things. Thank you.