Abstract
In chaos engineering, we introduce intentional chaos to find blind spots where the products may fail in a production environment. We then use the obtained knowledge about those blind spots to make the products more resilient should actual chaos hit production.
But what about the overall product development lifecycle (PDLC) where unintentional chaos takes a DevOps team’s bandwidth away from making the product what it could be? Such unintentional chaos seeps into teams as silently as a cat walks into the house and creates chaos (yes, you may recall those ‘Tom & Jerry’ episodes).
That’s what I refer to as ‘chaotic cat’.
Examples of unintentional chaos: obsession about speed (time to market); conflicting priorities within a DevOps team; manual repetitive tasks (toil); lack of monitoring & observability leading to high mean-time-to-resolve (MTTR); vendor products becoming a bottleneck in end-to-end observability; company acquisitions causing dependent application hierarchy; reactive fire fighting, etc.
But how does it all relate to Site Reliability Engineering (SRE)? Isn’t SRE just about SLIs, SLOs, and Error Budgets? Well, think again; or join this power packed session where I explain how SRE helps you bell such chaotic cats.
You may argue that such chaos in PDLC is ‘business as usual’ and we really can’t eliminate that, and you’d be right. That’s why it is about ‘belling the cat’; not killing it! It is about becoming aware of what can create havoc if left unattended and then taking proactive actions.
We are not going to talk about SRE concepts, much. We are going to focus on HOW to implement specific SRE practices that help teams grow out of routine chaos in PDLC. And as a result, enable focus on improving UX of the products on business critical factors - availability, performance, and overall reliability.
In a way, we are going to talk about shifting the SRE to the left! We will look at the top 3 themes that induce unintentional chaos - speed, toil, and lack of monitoring.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome, everyone. I'm going to talk about how to use SRE to
address unintentional chaos in your development
lifecycles. I refer to this unintentional chaos as
chaotic cat. So let's dive in. So let's
start with chaos engineering.
So, as we all understand that chaos engineering is the practice of
introducing intentional chaos in the production environments to basically
identify areas and opportunities.
Tom, improve residency. So while authoring my book on chaos
engineering, I wondered, but unintentional chaos,
the issues that basically seep into teams
work day in, day out, the scenarios, the workflows that basically cause issues
and take the team's bandwidth away.
What am I talking about? So let's dive into that.
I'm basically talking about three things or three
themes in general. Let's go into those
themes one by one. So the first one really is about speed,
speed to market, or time to market, or the number of deployments
per day, per week, per month. What is it that
we're trying to achieve with that? What number are we trying to achieve? How many
number of deployments are good enough? Where does the buck stop?
So without the defined number of deployments,
without the defined target number of deployments, teams are
constantly chasing an undefined target. Really? Is it
like no matter how many deployments you are
doing per day, per week, it is not good enough because
we have not defined what is good enough. In absence of thats, teams are
under constant pressure to elevate
the game to the next level. If they have been deploying,
let's say ten times a day. So maybe they want to take thats
number to the next level. They maybe want to achieve like 15 deployments
a day or 20 deployments a day. And things will not stop
there. They'll just continue to, as we call
it, continuous improvement, right? So they'll continuously try to improve the
number of deployments per day. But is it really healthy? Is it
really efficient way of looking at the continuous improvement?
Maybe not. So the next theme of unintentional
chaos is dependencies. So when I say dependencies,
I refer to the dependencies within the organization across multiple teams,
or the dependencies on the vendor products
or the vendor APIs. So whether it's about launching a new feature,
rolling out a new feature, or fixing a production incident,
team spend or invest a lot of time in
terms of trying to identify the right team to get the approvals,
or those right team to engage for
the rollout, or trying to fix a production incident
where the incident really seems to be
from the vendor product, not from the own service.
So it's a lot of time back and forth that teams spend
to ensure that the right team is engaged or everyone is on the same
page to fix the incident or to roll out the new feature.
And it really does take
a lot of time, takes a lot of time and also creates
the unintentional chaos. It's like teams SrE trying to
roll out a feature, but they have not received the approval, or let's
say the upgraded API version from the vendor, and the feature is kind of
stuck, even though their work on the feature is done. But because of
the dependencies on the vendor product,
the feature cannot be rolled. Tom, the production environments,
issues like that, that's what I'm referring to. And the next one is
measuring everything. Given the technical capabilities
that we have, with lot of tools and platforms
available to us, we can
measure almost everything. But the question
is, should we? And if not, how do we
know what not to measure? So the dashboards and
monitoring systems can get really complicated, really complex.
But the question then is, all the panels,
all the dashboards that youd have put together, are they really helping,
or they're just there because they need to be there?
How do you know which panels Sre really helping, which dashboards are
really helping, and which data is really helping, which one is not?
How do you know? Where do you draw the line? Which data to capture and
which data to discard?
That's essentially a big question. And in
situations where teams are causing almost everything that they can,
just because their platforms and tools allow them to,
can actually be counterproductive, can actually create more
chaos than solving anything. And let's talk about
how do we get into these kind of situations? What are the precursors,
what are the triggers that land teams
into these kind of situations and what they can do about it?
So one of the reasons, in my experience, is the differentiation between
product and service, or the lack of it.
We don't generally talk about product reliability
or product reliability engineering. We talk about site reliability or
service reliability. So thats is the difference.
So going by the as a service paradigm, a service is really
a running instance of a product. Traditionally, we used Tom buy computers.
Now we use computers on cloud, and we use them as a
service, meaning that we don't really own those
computers, we don't really own the infrastructure on cloud. We just use it
as long as we want it, and then we pay for as long as we
use it. And then when we don't need it,
we stop using it and we stop paying for it. So from that
perspective, service really becomes a running instance of
a product. Teams at the cloud provider,
they are just building that product, but when we use them
as customers, we use them as a service. So what that means
is when a product is really running in the production
environment, that's where the business critical factors like
reliability, availability, performance, those kind of
things come into play. Now, how do we
focus on those aspects while we're building the product
during those development lifecycle? So let's talk about that.
So where does SRE fit and how does it help?
SRE doesn't really have to be an operations
thing. The idea really is that how do we
align the core fundamentals of SRE with the product development lifecycle?
How do we integrate the SRE
practices right into the design and development
lifecycle so that SRE doesn't have to be an afterthought?
So talking of development lifecycles, let's talk about DevOps
for a minute. So one of the core tenets of SRE is service level
objectives, slos. Now, what this image is showing
is where does SLO fit in the DevOps
pipeline? So if you see on the top, we have the development team
working on enhancement, bug fixes, new features, and pushing
all the updates through a continuous delivery pipeline to the production environment.
And on the other side, we have the operations team monitoring
the production environment and the it systems and those platform teams
working with the cloud infrastructure and whatnot.
Now, service level objective, it fits right
in the middle of the pipeline. It's providing a common
objective or a common goal to the development as
bell as the operations team. So let's understand what
SLO is actually doing here, how does it help and
what is it all about in the first place? So SLO is really a
balancing lever or a common language that connects the product
and the service paradigm. I showed the product and the service paradigm
a few slides ago. So product paradigm
is all about innovation, it's all about speed,
how quickly product teams can launch new features and how
innovative they can be. And service paradigm is all about stability,
how reliable the service is, how available the service is. So SLO
provides a common objective, a common language for the product and the
service paradigms together. Product team can be as innovative as
they want to be, as long as they meet the service level
objective. And from those service paradigm,
service perspective, the service needs to be as reliable
as required by the SLO. So it basically
bridges the gap, brings a common language between product
and the service paradigms. And in other terms,
without going into too much of details in
terms of SLI and SLO. So SLO is
a journey that basically translates user expectations
or the customer journeys or the critical business transactions into something
that can be implemented technically in the monitoring
systems. So going by the example on those slide,
let's say a product team is building a login module, or the
login module already exists, that they sre maybe trying to add let's
say a multifactor authentication to the login. And if
they're trying to roll out that feature, the user expectations,
the end user expectation really is that the login should complete successfully.
And while rolling out that feature, the multifactor
authentication feature, product teams really
need to ensure that at least 99% of the
login request could be successful. So they have some sort of
margin of failure. In other terms we call that as error budget.
So the product teams know thats when
rolling out that feature, they need to ensure that the login request,
they continue to work. 99% of the login request should be
successful even with those new feature rollout.
And from the service perspective, the teams have
a target that they need to ensure that 99% of the login
requests SRE successful. So they have a reliability target, they know
how available, how reliable the service seeps to be. So it's
really bringing a common language for the product as well as the service teams
to meet, basically. So the process to define
and implement SLO is really a journey,
or to translate the critical business transactions,
or to identify those risk on the critical business transactions
and translate that into some objective numbers that can be implemented
in the monitoring systems. I will not be deep
diving into the process itself. That's not in the scope of
this talk. So let's continue with the next theme
of unintentional chaos. So the time that teams
spend working out the dependencies, waiting on
other teams basically falls
into a bigger bucket. In the terms
of SRE, we call that bucket as toil.
So let's talk but toil for a minute.
So toil is work, but it's a kind of work thats
chaos. Certain characteristics to it, it tends
to be manual, repetitive, can be automated,
it's tactical, it doesn't carry any long term enduring value, and that
it scales linearly with the service growth.
All the characteristics that are mentioned under toil.
So any work that the teams are doing that has
most of these characteristics, we call that
work as toil in the SRE terms, and waiting on other teams,
waiting on vendor updates,
the inherent dependencies,
in my experience, that kind of work basically falls under
toil in most of the cases thats can be automated and
it doesn't really carry any long term enduring
value. Some examples like I mentioned on the slide, it's like
setting up some environments to reproduce a production issue or
upgrading the API versions manually or basically
work about work, identifying the latency of an application,
how fast users can log in and creating some quick
one pagers, capturing some critical metrics.
So all that kind of work basically in my
experience falls under toil. So bell talk about how we
can address it and how
we can address it and release the bandwidth that the teams are
basically spending in doing this work. So the first things
first, toil cannot be eliminated, it can only be reduced.
And how do we reduce it? It's basically a cultural change
over a period of time.
As the team start focusing more and more on
automation and engineering. Not just automation, more of
engineering.
Engineering basically involves all parts
of the STLC or the development lifecycle.
Automation may not necessarily involve all those steps,
but engineering definitely does. So over a period of
time. When we start focusing more and more on the engineering
efforts, toil tends to reduce.
And in the next few slides I'll show you an example
where we can apply some sort of engineering
mindset to address the dependency
issues that I talked about in the beginning of this talk. So in
terms of measuring everything on monitoring,
in my experience, measuring everything is as bad as
measuring nothing. So really youd need to be strategic
in terms of what is it that we want to measure really?
And the characteristics
of a good monitoring system that it is, is very strategic and it's
very focused on things that really matter from
the user experience, proactive or from the SLOS perspective.
So by fine tuning the monitoring and
the measurement strategies to be aligned a bit more
towards the slos, it's definitely a good step. And the
other thing is we look at application monitoring
and infrastructure monitoring, but I think there is often
ignored or missing aspect, which is dependency monitoring.
We need to understand how an application or a service basically
creates to other applications and services in the workflow
and identifying the downstream dependencies or
the upstream dependencies. Definitely a good strategy to have in
the monitoring space, to be able to
use monitoring for the production incidents or
to ensure thats the time taken to resolve production
incidents is minimized. So let's connect all
these does I talked, but a lot of things. So let's try to
connect all those dots together and then see if
things really make sense. So going over those themes
again, if you talk about speed now,
so with the slos being defined so we can
now define what
is a good enough number of deployments per day. So as long as
the number of deployments is not impacting,
the SLO teams can continue to deploy. So they now have a
defined target, they know where the bug stops,
right? So with the slos we
get a balancing lever, like I talked about. And with that we
can kind of put some numbers into how
fast we want to be in terms of
number of deployments and time to market. So in terms
of resolving dependencies with data. So this is one
example I often quote. So if you were developing
a gaming application and you would want to integrate the
Discord channel into that application,
and if you were to define the SLO
or the user journey, users being able
to connect with their friends on
the Discord channel,
the service would have dependency on a vendor,
on the vendor availability and their APIs and their
performance. So basically you can look at the discord's
status data, it's available on
the Discord status page publicly,
but this is really an example. So you can create
probes to the downstream dependencies, you can get data from
the downstream applications, and when defining the
dependencies and the availability and the slos for
your own services, you can actually go with those data.
So you don't really have to depend on or wait
on on the downstream applications. You can basically
create some sort of probe, you can
collect some trending data from the downstream applications and
make your decisions accordingly. But again, this is only one of
the ways, and it has its own pros
and cons and there sre certain aspects associated
with this approach, but again, one of the approaches, but you get the idea right.
The idea is really instead of just waiting
without any data and kind of depending
on the manual collaboration and things like that.
So we can go with data if it's available. And this is
one of the approaches that we can consider
to resolve the dependency bell.
Finally, on measuring and monitoring
everything like I discussed, we can fine
tune the monitoring and measuring strategies to be more focused on slos.
And in fact, even for the new features, if youd can define the
SLOS upfront or define the
strategies to ensure
that new feature, how do we define the reliability of the new feature and
then define the slos upfront and then
bake that into the development? Bake that into the coding
while the feature is being developed. So it can really help
us achieve a very high signal to noise ratio.
And the monitoring can be very
powerful and useful tool to ensure
that the production incidents have minimum time
to resolve. And the metrics like MTTR or MTBf,
they really improve over a period of time with these kind
of strategies. So that's that. So by shifting SRE
practices to the left, we can minimize the unintentional chaos
in the product development lifecycle and ensure the SRE practices
are baked in, into the design, into the
development. And SRE doesn't have to be an afterthought.
And by doing that, we can ensure that the products that we build,
when we deploy them into production, they are reliable by
design. So that's that.
Thank you so much for listening to me and
feel free to connect act and we can have follow up
discussions if you need to. Thank you so much.