Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone, I'm Leonid from Stackpulse. And today I'd like to
talk to you about how calcium paired with a method
called generic mitigations can serve as these really strong
foundation to building resilient systems, maybe even leading us
to this dream of self healing systems. So if any
of this is of interest for you, stick around. It's going to be an interesting
talk. So, in this conversation, we will talk about what
is Kaus engineering and what does it usually do
for us. We'll focus a bit on resilience of our systems and
maybe define how much of it do we really want or how
much of it do we really need. Then we'll also try to
focus on whats does it mean to be proficient, to be
really good and effective at building resilient systems. And then we'll
dive into the world of generic remediations, figure out what they are,
how they can help us, and how this combination actually serves as a very
strong conditions for what everybody wants, or what everybody delivering
services wants, which is resilient services.
So, yeah, we are at the chaos engineering conference.
So I bet everyone here has an idea about what
KAus engineering is and why do we need it. But still,
to make sort of like a structurally well delivered
point of these, I'd like to take this presentation. Let's remember
whats council engineering is defined as a discipline of experimenting
on a certain system. And that experimentation, its goal
is to build confidence in system's capability to withstand
turbulent conditions in production. What does it actually mean
to withstand turbulent conditions and why
the experimentation is so important? Why can't I
plan my system ahead of time to be very resilient? So turbulence,
as a word is defined, let's say, by Cambridge dictionary as
something that involves a lot of sudden changes.
Now, these Merriam Webster definition is talking about irregular
commotion, right? Things that are both very sudden
and unexpected in their occurrence and can also
be very radical in their impact.
And this is sort of like the foundation of Kaus engineering,
because this turbulence means that if
we plan for it, if we build certain
things, planning for certain variations,
the production will still surprise us, right? Surprise us in
again changes being sudden changes and challenges
being not exactly how we expected them to be. Let's maybe look
at comes engineering from a different perspective and
try to understand what does it do for us.
Let's say we invested the efforts and built a foundation
and we start running these experiments.
Why are we doing all that? So when we are building a
software system, especially when we are building a software service,
its fault tolerance, or its resilience,
is a very important piece, because for those whats
consume this service,
the quality of service, in many cases even protected
by service level agreements between the consumer of a service and a provider
of a service, is a very important goal.
Now, this fault tolerance or resilience gets constantly
tested by changes in infrastructure,
changes in networking, changes in application, changes in
user consumption patterns, right? All these can destabilize
our system and cow's engineering experiments,
right? They can simulate these things so
that we can look at the outcome of the experiments and build
our systems to be resilient to similar,
right? Not exactly the same things as we've seen with the experiments, but similar things
that will happen in production. So, a very short summary
for now, right? So, calf engineering experiments
help us surface how our system behave in various
unexpected conditions that can occur in production. But the end goal
is not to run experiments, right? The end goal is to achieve resilience,
right? And in order to do that, we need to take the deliverables.
Whatever these experiments surfaced, we need to prioritize these
findings. And then we need to invest in modifying
our systems, sometimes their architecture, sometimes the
infrastructure, whats run on, and so on. And so both in order to
become more resilient to these particular types of
interferences that were surfaced, right?
So the goal is, resilience comes.
Engineering helps us sort of like shed a light spotlight
on where we are not resilient enough and prioritize our work.
How much resilience do we actually want in our products?
Now, this may sound, at first it's
a very strange question. Imagine a situation where you would be sitting
in a restaurant, ordering a dish, and then a waiter would ask you, okay,
dear madam or sir, how tasty, how delicious,
would you like us to make this dish for you? Would you like a really
delicious, once in a lifetime culinary experience?
Or would you like us to be like passable? I mean, decent,
right? But nothing to write home about. Naturally, each and every one of us would
answer, whats do you mean? I mean, I came here to this restaurant to enjoy
myself. Of course. I want you guys to cook it to the absolute best of
your culinary ability, right? Similarly, you're about to consume a service,
any kind of a service, an email service, a monitoring service,
an HR, whatever, right? How resilient
would you like it to be? Well, you want it to be very
resilient, right? Because you need it every time.
If it's not there for you, why would you compromise
for anything less than the absolutely best resilience?
Now, in theory, that sounds great. In practice,
it costs to those of us who operate services
to make them more resilient. And if I had to look at these
improvements in resilience as a function of investment
cost, the graph would look like this. Now what it means
is that at first, introducing an improvement in resilience,
even quite a significant one, can be done with bearable
costs. But as we become more and more resilient,
bumping the resilience up yet more every
time costs us more and more and more.
And maybe adding these extra, however you
manage and measure your resilience, in the end you may add this
little resilience, but it will cost you twice or
twice as much. And here's the deal. At the end of the day,
not every investment makes sense.
If we're building a commercial service, these it has consumers
and these people, they are willing to pay us for this service
something, right? And then we
have our cost of operating and delivering these service.
And everything in the middle is our margin, right? That is if we are a
commercial service provider. Still, even if we're talking about internal
service, there is a certain point of
acceptable comes that we are building to invest.
And in reality, these amount of resilience that we cant is
actually the intersection of these two, right?
It's what is acceptable both as a resilience
and as costs. Let's maybe look at it slightly
from a more complex but still
very similar perspective. So we have our resilience.
We will just turn it into the customer facing
aspect and we will call it service level. So we have a low service
level and then we invest additional costs in improving it.
And then we have high service level. And now there are consumers
for our service. For them,
acceptable service level may sort of like vary depending
on cost. But still in this line, there are
two very important points that we should all consider. There is this
point we call a minimal acceptable service level.
Anything below this point just doesn't make sense. Your consumers would
not cant it. And then there is a second very important point.
And it's the point where comes for delivering a more resilient
service are actually higher than what the service
consumers would be willing to pay. Naturally,
cost sensitivity may differ between one prospect consumer to
another. So there is a range here. That's what gives rise to phenomenon such
as premium products, right? Where people or organizations are willing
to pay more for something that is more resilient,
more luxurious. But still, no matter how the
model looks, the conclusions will be these same. So there is this point naturally
for us to operate with a margin. We need to be somewhere
within this area, right? Because this way, certain amount of
consumers, customers are willing to acquire a service.
With this service level. We are operating it for costs
that are lower than that. And in the middle, this is what we make.
So in essence, these is the area
where we would end up operating our service depending on
different properties. This is in essence how
much resilience we cant. Whats comes it mean
to be proficient in building resilience, right? Because we would
like to be efficient. We would like to be proficient in
whatever we are building. Here's a very important point
that may differentiate between a successful and proficient
service provider and less successful one.
When talking about resilience, we define proficiency as
the ability to deliver more resilient services
for less cost. Now, how do we do that? As a
matter of fact, there are many ways to do that. In fact,
usually it would not be a single strategy that would differentiate,
let's say in two identical conditions, between someone delivering
certain resilience level or service level for
cost a and somebody else delivering exactly the same resilience
level for cost that is higher. Usually it needs to be
a combination of different tools, different methodologies,
caused in a smart way, each and every one to apply
to a specific problem. And the combination would
actually deliver a very proficient solution where achieving
a certain resilient level will be done for
reduced costs. If I'm proficient, I can build a solution
that is very resilient for whats much. If somebody is
less proficient, same level will cost them this much. To sum
up, this part so far is that as we already established,
our end goal is to improve the resilience of our services, right?
To keep our customers happy, to keep ourselves within the service level agreements
and then service level objectives, and to enable
our digital services. In order to achieve better results
in this, we need to be familiar with different methods,
different technologies, different methodologies, right.
And this way we will be able to combine and build the
right combination of different approaches to resilience that
will be right for us and for our particular
use. Case so far about resilience, generic mitigations,
whats are these and how are they even related? So, a mitigation
is defined as any action taken to prevent
certain impact of outage or breakage on our service,
on our system, in production. Applying an emergency hot fix to
production because something has been broken down,
that's a mitigation. Connecting to a production machine
to clean something up, to restart something, as bad
as it may sound, is also a mitigation,
right? Generic mitigation is a mitigation
action. Any, not just the two examples that I previously mentioned that
can be applied to wide variety of outages.
As a matter of fact, sometimes it is being applied when
an outage happens before the source
of outage is fully analyzed
and identified. Wait, whoa, whoa,
whoa. Am I talking about a bad software engineering?
Am I trying to convince you to apply band aids
to sweep the real problems in your software under the rug? I don't know.
Something bloats in the memory. Well, let's restart it once a day and nobody will
notice. Something fills up some partition. Well, let's clean it once
a day or once a week, and again, maybe nobody will notice.
No, this is
definitely not what I'm talking about.
Generic mitigations is a concept that was born
in the world's most advanced from the sort of like software
architecture and operations quality organizations,
organizations such as Google, such as Facebook.
So oraze, actually, in order to explain
a generic mitigation concept, let's look at a typical timeline
of an outage. Something where a certain problem happens
to your system or service. How does it go? So naturally, it begins with
a source, right? Something bad happens, hopefully. Then a monitoring
system identifies certain symptom of something bad
that has happened and raises an alert, maybe more than one.
Then something or someone looks at these
alerts and tries to understand the context,
figure out the exact impacts and boundaries.
Then potentially a triage occurs, right? Because remember I mentioned that
alerts sometimes show us symptoms of a
problem. So triage would try to figure out, okay, these are the
systems, but what is the actual problem? Then we
would usually perform root cause analysis to understand what causes
the problem. We will implement,
test, review the fix, and we will deploy to production
this box chaos, a different color. Because this is where the impact
on the users of our system. This is where it ends,
right? This thing is taking time,
right? How much time? Well, you know what,
actually it really varies, right? We've seen outages being
resolved within mere minutes. And unfortunately, we've all seen outages even
again at world's leading digital services that take hours,
I'm afraid to say, sometimes even days. How about considering
an alternative timeline where sometime
at the triage stage, where we start understanding where
the real problem is. Instead of diving into analyzing it
further, we apply a mitigations strategy
that restores the service to its operation, and only then
we perform root cause analysis. We develop a fix and
we deploy it to production. As you can see, there is
a time difference here in the outage,
right? This is our gain
in our service level objectives. This is the gain of our service users
because service becomes operational earlier. How much
earlier? Well, you know what, let's take just a small piece of this
whole chain, a time it takes to a
certain fix once it's implemented and verified
to be deployed to production. So as recently as a couple of weeks
ago, there has been a discussion thread on Twitter between
leading practitioners. How quick should
our code be delivered to production in very modern,
continuous deployment, progressive delivery environments for
it CTo be considered good enough. And for instance, the consensus
there was that anything sort of like around or within the boundaries of 15
minutes, 1515 is considered
to be good. So just whats small piece is 15 minutes
of an outage, maybe complete, maybe partial of
your service to its customers. Let's not even talk about how much time it
can take. You CTo understand the root cause to implement the
fix, to verify, review the fix, make sure that it's these correct
fix, even if the right people to do the analysis and the fix and
these review are currently available. Or maybe they're not.
Or maybe it's the middle of the night, right? This game
is absolutely critical and as you can clearly
see, it doesn't come at the expense of understanding
what went wrong, fixing it and making sure that it never
ever happens again. Not only it exactly, but anything like it.
It is all about having a set of tools
that allows us to return our service to production earlier.
That's what the purpose of generic mitigations is.
Interested now you would like to ask me
for examples of generic mitigations. Well, I'm glad you may
have asked this question. Let's look at a couple of patterns to
sort of like explain what I'm talking here about.
Rollback. Rollback of any kind, business logic,
binary executing it, configuration change data status,
rollback to last known,
sort of like working started. Some people may say, well yeah,
rollback is very simple. Of course we support rollback. I mean we did support rolling
out the update in the first place. But unfortunately it is not as
simple as it may sound. In multicomponent
system with dependencies, with data schema slicing,
being confident to perform a multicomponent rollback,
testing it from time to time, being able to
remorselessly run it in production, to return it to
a solid started. That's not simple.
That requires preparation, that requires thought, that requires
testing. Let's look at a different generic mitigation pattern.
So, upsizing or downsizing more
and more systems that are building cloud native architectures,
supports, horizontal or vertical auto scaling,
usually within certain boundaries. Sometimes, especially when we're
talking about scaling things down, human interaction
and human intervention should be very much desired.
And again, scaling up a single stateless component,
if you're using a modern orchestrator, probably sounds not that complicated.
Scaling out without taking into consideration the
relationship between different components may actually
introduce more noise and just shift the problem in your architecture.
This is where a strategy for how do you
scale things up? How do you scale things down should be well thought of,
well rehearsed, and of course when needed, applied.
So again, not as simple as just launching a couple
of more pod replicas,
draining traffic from certain instances, and then flipping
it over to a different cluster, different member, different region.
Again, something that is a great tool
when delivering multiregion multilocation
systems. This flip of a traffic,
managing it so that there is as little impact
as possible on the service users. The smallest amount of
impact possible, of course, is zero, and that's the desired one. That's what they would
want you to do. Let's hope that we can do it again.
Rehearsing that, making sure whats it is operational so that
the least technical person in your organization in the middle of a night will
be able to execute the strategy. Strategy really needed to
be thought over is whats all. No, of course not. There are
many, many more. Let's give a couple of more examples just to open our eyes.
CTO various possibilities. So quarantining a
certain instance, a certain binary, a certain cluster
member in a way that we remove it from the rotation
so it stops handling production traffic that gets
rebalanced among other healthy peers, and then
investigating the root cause of the problem. On this particular instance,
a block list. Being able to block a specific user, a specific
account or session that creates challenging,
problematic series of requests, queries, et cetera.
Being able to do this in real time. Being able to do this
granularly. Being able not just CTO black and white,
block or unblock it, but maybe actually introduce guardrails or
quotas on it. Preparing a strategy for that could again
be a lifesaver in production situations. If it is mature,
if it is well tested, if it is again applicable by the least technical
member of your staff in the middle of a night if needed. Disabling a noisy
neighbor. Imagine a shared resource, a database, for example.
These extreme pressure from
one set of components may impact the ability to operate
other set of components. Identifying the source of the noise
again, inflicting guardrails,
quotas, or maybe pausing it for a certain period of
time to release some more critical processes. That's a
strategy. Thinking about it, thinking how to make
it repeatable, how to make it usable in real time. This is definitely
a generic mitigation pattern that many world leading organizations use
today. So to sum up this part, generic mitigations
are not practices of applying patches and band aids to production
and sweeping the real problems in your products under these rug.
No, it is a practice of building strategies and then
tools for improving your ability to meet your own service level objectives
and to get your service back to operational state faster without
compromising on root cause analysis, quality of your code,
management of your technical debts, definitely building them,
testing them, keeping these in a warm state so
that you're not afraid to use them. This is a very important
tool in the toolbox of being proficient in
building resilient systems. Very, very important notion.
So how does the two connect Kaus? Engineering on one end,
engineering mitigations on another end. And when they
are used together, does the end results
become greater than just the impact of each and every one of them, or not?
So calcium engineering can recreate unexpected,
irregular, turbulent conditions similar to those that
we will encounter in production, and sort of prepare
us for the production challenges. Generic mitigations is
a very important tool for meeting those production challenges
and keeping our service level objectives, while keeping
our expenses on reliability at bay. How does
the combination of the two actually provide a greater value than
just each and every one of them? Here's how. So, generic mitigations.
Using them, we prepare ourselves for keeping these service operational under
unexpected conditions. And how do we know that the investment
here was made prudently? Because we tested with
comes engineering experiments and we indeed see that now
different kinds of experience gets remediated
by the generic mitigations. Furthermore, these calcium engineering
experiments surface points where we are still not
ready for production fault tolerance.
And then it's a continuous cycle of us strengthening
these areas, again potentially with generic mitigations.
And together it really helps us manage our investment.
Right. Remember this cost of delivering a very resilient service,
how do we do this proficiently? Indeed, by surfacing
where our biggest problems are, by providing cost
effective tools to resolve these problems, by immediately verifying
that these tools indeed have resolved these problems, and furthermore,
building a cycle of prioritized surfacing our next investment,
et cetera, et cetera. This is where the combination of
these two is extremely powerful. Now, what would
happen if we just invested in comes engineering? Well, we still need
to resolve the problems that these experiments surfaced.
If we don't have a rich tool set of how to deal with these,
we will end up investing quite a lot in rearchitecture
and many, many other expensive things where this may not be
the only or even the best solution for the problem. If we invest just
in generic mitigations, our ability to test them in real life is
very restricted. I mean, sure we can recreate like
surrogate scenarios where they would be caused into
action, of course, but then again, our confidence in
the ability to withstand the real production turbulence will be
much, much lower and it will be more difficult,
not impossible, but still much more difficult to figure out.
Where do we start? Which particular mechanism
will give us the highest return on investment in terms of
raising our level of resilience for the cost and then the
second highest, and so on and so forth. This is actually
where the combination of two is really helpful
as a continuous improvement process in the
resilience of our software services. Well, this is pretty much
what I wanted to comes in this talk, some afterthoughts
and maybe comes suggested action items if you're considering to do something about
it. So what do you need to develop and use generic
mitigations? First, you need a platform for developing
these logical flows. You would ask,
wait, what's wrong with just using the, I don't know, programming languages
I use today and so on? And there is nothing wrong. Still, you need to
verify that you're using the right tool for the right purpose.
And these mitigations, you should be able to
make them modular. You should be able to write them
once and then reuse them in different conditions with
different strategies inside the different environments. You should be able to
share them right between users of similar components that
face similar challenges. Just as any software,
it needs to undergo software development lifecycle. You need to version it,
you need to review it, you need CTO test it itself, et cetera, et cetera.
So thinking about the right way
to develop these flows
is very important. Secondly, you also need
a platform that will trigger, or maybe in a broadly
sense, orchestrate these generic mitigations and monitor how
successfully they are applied, involve humans in the loop
if required, et cetera. It is very important
that the infrastructure for orchestrating these mitigations
is separate from your production infrastructure.
Otherwise, when that production infrastructure is impacted,
you will not be able to use it to mitigations itself.
Right? This gives rise CTO things such as
the observer cluster pattern. That may be
a good discussion for another time. Visibility.
Every time these mitigations are invoked, it is of utmost importance
to be able to analyze what exactly is happening,
how successful their application was and so on and so
forth and again collect and process data for
learning. So if you're seriously considering going into generic mitigations,
these are the main two fields I would recommend looking into. Similarly,
when talking about house engineering experiments, whats do
I need to be able to perform these successfully and
have them run efficiently in my environment very similarly.
So I need a platform for injecting these
comes variants into various layers of my architecture.
And a good set of experiments
should actually target different layers on these infrastructure level,
on the network level, on the user simulation level. Again, because this
is the set of different directions from
where the variance in my real production would come and therefore my
experiments needed to be hopefully close to that.
Secondly, a platform for conducting experiments in
a responsible manner in my environments, something that
comes with a lot of guardrails and an ability to contain,
maybe to compartmentalize the experiment, right.
Extremely important data collection, learning, an ability
CTO stop the experiment at any given point where things
have too much of an impact if containment has failed. Picking the
right platform here again is extremely
critical to success and to getting
return on investment in your investment in comes engineering as sort
of like the final afterthoughts. I'd like to leave you
with a thought of if what I've been telling
makes sense to you, and if you are thinking of applying
the combination of these two methodologies,
how would a platform or a methodology for
continuously applying the combination look? Just a
thought. So, to sum up, I'd like to
thank you so much for listening in. I would
like to hope that whatever we've seen here makes sense.
Maybe you have picked certain ideas, maybe you'll be able to implement some of
these in your environment. For further reading, there is a growing amount
of information about generic mitigations. There is a
fair amount of information about comes engineering and various ways to apply that.
I'm leonid from stackpulse. Thank you so much for tuning in.