Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everybody. My name is Mikolaj
Pawlikowski, and I would like to thank you for coming
to this talk. I'm going to talk to you about something that
really excites me. I'm going to talk about chaos engineering,
and in particular how it overlaps with sres
and why sres really should love chaos
engineering to begin with. Here's the
plan. For the next half an hour or so, like any good
scientist, we're going to start with defining our terms.
We're going to see what SRE is, we're going to see what
chaos engineering is, and then we're going to focus
on the actual overlap where the two come together
to create an extra value. I'm also going to
use this opportunity to talk about my latest book, Chaos Engineering.
Crash test your applications, and I'm going to finish with two
demos to kind of illustrate what I really mean
and kind of show you in practice where chaos engineering
can really help the sres. Sounds like
a plan. Well, hopefully. So let's start with SRE, the site
reliability engineering. I'm pretty sure
that some of you at least, are familiar with that. It's a
concept, a term that was coined at Google.
And in one sentence, it's basically
what you end up with if you give the questions work to
software engineers. Right?
So the way that I'd like to think about is that
if you, on a Friday evening, go to an SRE
pub, assuming that kind of place exists,
imagine what you're going to hear about. You're probably going
to hear things like reliability, like performance, like latency,
like availability. You might hear
someone in these corner ranting about on call and
alerting and monitoring. You're probably going to hear someone
talking about business objectives and slos
and slas and stuff like that. You might even hear
someone mention latency,
and that's all good. That's basically what
you gives you a good idea of what these
people actually care about. If this is what you talk about on Friday evening,
whats means that they must deeply care about and this know kind of
like the kernel of what SRE really is. These are people
who deeply care about making sure that things
run smoothly, right? So it's one
thing to kind of write software and features,
and it's kind of like a different
problem to actually run that and run it well
and at scale and without hiccups,
right? And this is where we get to the core
of what SrE is, right? So all these
things that I mentioned in the virtual pub, they all kind
of have to do with reliability. Reliability is
kind of this encompassing term that is
a little bit vague and kind of can involve
pretty much all of those words here, right? So it
might be that your system is
performance based. It might be that, for example,
the value that your system provides depends on
how many operations per second it can
produce, which case these reliability will
be to make sure that that performance can be sustained and that
it can be sustained long term.
Or it might be that it's much more important
for your system to be highly available because if
users can't use it for even a small fraction of time,
they're going to be much more upset than if
the performance is degraded. So that might be your
reality of reliability. It might be latency.
It might be that people SrE browsing funny
videos and they really care about latency. And if
the video is too slow and they're going to go
and spend their money elsewhere or
watch the funny videos elsewhere, so that latency,
you can measure that and this is your reality of reliability
to make Sre that that latency is maintained.
Right. Then you have all the monitoring, alerting, on call.
These are the things, these are the tools that show us
the visibility, give us this visibility
into what's actually going on under the hood in the systems
and alert us when our attention is needed.
The dreaded on call and the paging where
people sometimes have to wake up to fix something,
these are the people who do that. These are the people who set up the
alerting and set up the monitoring so that they know that
something's wrong. Right. The capacity planning,
that is also part of your reliability. It's not
the most glamorous. Hardly anyone
likes to talk about a massive excel file when they try
to predict how much resources they need. But it's
essential. Someone needs to do that. Someone needs to think, okay,
if we're going to grow by this many users in
the next quarter, we're going to need more
disks and more cpus and more ram and more x
and y, right? This is part of the
reliability or the slos, right?
Probably heard of the slas. You might not have heard of
slis and slos. SLI stands for service
level indicator, and it can be
any quantity that's measurable that you
care about in terms of verifying that
your system runs well. So let's say that, for example,
you run some kind of API. It's an API where
people can send a request and they get the
right meme. Right. Your meme API and
the SLI that you might care about is for example,
the speed of the response, let's say whats the ninety ninth
percentile of the response time is
something that you care about, right?
With that you can create design an
SLO which is an objective,
service level objective, which is basically a range of
a particular SLI that you care about,
like these mentioned ninety ninth percentile of response
time. You might want it to stay within one
hundred and two hundred milliseconds, for example.
Right. That's an objective. And then SLA
is an agreement. It's basically a contract between
two parties when one of the parties provides some kind of
service and they promise that this SLO
is going to be satisfied or else,
and typically the or else is some kind of financial
compensation or something like that. It can be a cool t shirt,
can be anything really. But the idea is that we're
going to do whatever we can to make sure
that that SLO is satisfied. And if it's not, we're going to
make it up to you somehow. Right. So this really
is kind of an example, a sample of what the
SRE is. And some of you
will be like, okay, well that kind of sounds like operations.
And yeah, bingo. That's basically what it is.
It's the operations just with more software
engineering to it. The software engineering
being used to remove the bad bits, to kind
of automate the bad bits away.
And the bad bits have a really funky name, a nice name
of toil. Right? We typically speak about
toil when we talk about SRE. So the deal
for the SREs is that they're going to spend
less of their time dealing with the actual operations,
these on call, the crisis management, and spend
most of their time using their software engineering skills to
roll out automation to automate
the toil away out of these equation. Right.
So they're going to write software to automate things
so that there is less ops and on call to
do to begin with. Right.
That said, there is a lot of hype
about that, but at the core of it, this is what it is.
A lot of the systems are pretty big and pretty amazing and pretty
everything. But you don't have to be at Google to be an SRE.
And there are plenty of systems that need the same kind
of treatment to verify that it runs smooth as
silk. Right.
Okay, so kind of defined SRE.
What about chaos engineering? I'm guessing whats
some of you probably first came into contact with chaos
engineering in the context of chaos monkey.
And that's great, that's fantastic. But it also
kind of gives chaos engineering the bad rep if you google whats
chaos monkey was randomly taking down
vms for Netflix. So whats these can detect
things that they didn't detect with other testing techniques.
But if you google that, you're probably going to end up
with something along the lines, let's breaking things in production
and breaking things on
purpose. And that's great. But the kind of
breaking things is not really what we're after here.
What we're after is experimenting
to verify hypothesis about our
system's behavior in the presence of failure.
So yes, we want to inject the
kind of failure, introduce the kind of failure that we reasonably expect
to happen. But we don't really
try to break the system. We actually try to learn
whats either the system behaves the way that
we expect, that's the hypothesis,
or denying that and learn that
it needs to be fixed. Okay, so at
the core of it, it's much more scientific than just going and randomly
smashing things in production. We actually want
to very finely control the
amount of failure and these type of failure that
we inject most of the time to
verify that what we think is going to happen is
actually happening, right. That we are right.
Thinking about properties of the system. And this is really where
the value comes from. And sure, there is the
aspect of kind of like on the verge of gas
engineering and fuzzing techniques where you
want to create this kind of like half random
pseudorandum situations where you can end
up with combinations that
you didn't think about yourself to test out, but this
is just a part of it, right? So next time you see that it's,
oh, let's randomly scratch things in production,
that's not really what it is about, and at least
it's not really what it's about for everybody.
If this is a good idea for you, if your
production system is of a nature
that allows you to do things like that, that's great.
That's absolutely stunning. If you are at the maturity
level these, you can actually run this kind of thing in production.
That's great. Because if you think
about that, you can never really one hundred
percent attest anything before it
hits production, right? Because you can try very hard to
reproduce the kind of failure that you expect. You can try
very hard to reproduce the same environment in
some kind of test stage, dev stage,
pre production stage. But at the end of the day,
there will be things that will be different. It might vary very slightly,
it might just be like a user pattern, but it's
technically either impossible or prohibitively
expensive to actually do something like that right.
So if you can, this is like the holy grail,
when you run things, kind of things in production and
you can verify and uncover real problem on a real
production system, but it also gives it bad rep because
when people read about that, they stop taking it seriously. It's like,
well, yeah, okay, cool. We would never do it here.
So just kind of want to remind
you that this is the case.
So now we have these two concepts.
We have the SRE on the left hand side and we have
the chaos engineering on the right hand side. And I
would like to argue and spend the rest of this time
that we have together now to kind of see show
to you that where the real magic happens, where the love
happens, is when you start using chaos engineering for your
SRE purposes. Okay, I'm not just saying that
I deeply believe that. In fact, I believe so
deeply in that that I wrote a book about
it. It's called chaos engineering crush.
Test your applications. It's available right now in the
early access from money. If you go to money com looks
chaos engineering. And it's trying to show you that
you don't need a massive distributed system. And if
you have one, that's great. But you can apply
things, chaos engineering techniques to pretty much any system.
It can be as simple as a single process running in a single
computer. I have a chapter. These whats shows you
how you can treat that single process and that single computer
as a system and verify things like
for example, block some system calls and
verify that actually this process as
a system might not behave the way that
you expect it to behave. It might not have the error
handling or the retrieves that you expected it to have.
It might actually work differently. And then it kind of builds
up from the small examples,
looks into how to introduce failures between
components, how to introduce slowness in
between components. On the networking level, it talks about
introducing failure through modifying
code. On the fly. If you happen to be running
something that executes in a JVM,
there is a chapter pardon where
you can learn how to inject bytecode
into your classes without actually modifying the source code.
So you can take someone else's code and inject the
type of failure that you expect and
then verify that the system behaves as a whole in that
manner that you expected. And these, it goes all the way, builds up all
the way to things like Docker, where you test out things running Docker,
or you test out Docker itself. And kubernetes,
if you have larger systems that are running these distributed
and kind of anything in between, or even if you want to
test chaos engineering, test the
kind of failure that you expect in your front end Javascript.
It's really a great tool and
I'm really wanting to show people
that this is something that you can use in many situations, and it's
not just for Netflix and for Google. This is
much more broad than that. Okay,
so these are the things that we just discussed.
And as you can see, apart from
snarky t shirts that don't really need an improvement, typically sres
have them on point. All of these things can
be helped with the use of chaos engineering.
All of these things can be designed,
experiments on, and can be verified
through this experiment, verified in terms of hypothesis
and assumptions that we have about these things.
Okay, so just to show you
these overlap is big, I kind of felt like the previous slide
was showing the overlap as a little bit tiny. So I kind of
zoomed in here and I would
like to now show you a little bit more in practice,
what I actually mean about when I say that the
case engineering can be leveraged for SRE. So let's jump
to my first demo. What you're seeing here is the
VM that comes with my book.
It's more or less vanilla ubuntu
with all these things that you need for KS engineering pre installed
here. I'm going to use one of the examples that
I have here. This is coming from a chapter
on Docker, and it's a descriptor for Docker Swarm
or Docker stack that describes two services.
One of them is called Ghost and it basically just runs the
image for Ghost and then provides some
configuration for the database and the database
itself, which is mysql five point seven
with not a particularly safe password. So the thing
that what it does is basically
start these two containers with the
configuration two point each other so that we
can run ghost. And if you take a look, I actually already
have it running. I already run
the docker stack and you can see my ghost container
and you can see my MySQL container. If you're not familiar with
ghost, it's a blogging engine.
It's a little bit like WordPress, but just a little bit more
modern. Also note that the
names of the containers, one of them starts with Meower Ghost
and the other starts with MeowDB because we're going to
use these names later on. So first
thing we should do is actually verify whats this thing is
working. So we should be able to go to
one hundred and twenty seven to
port eighty, eighty and we should be seeing the application.
And boom, looks like it's actually working.
Okay, so this is great. We get something,
but what we actually care about is some
kind of sli, some kind of metric
that we care about. And we want to
make sure that we satisfy because that's how we do as sres.
Okay, so one of the most basic
things that we can do is use something like Apache benchmark
to basically create a lot of requests
and verify how quickly this request return.
This is just running ab with ten seconds
and concurrency of one to the one
twenty seven, zero, zero, one eighty, eighty. So that we get an
idea. So you can see that during this ten seconds we got
one hundred and five requests that were complete and
we got no failed request, which is also great, which translates
into ninety five milliseconds per request,
which is, let's be honest here, running a local host,
not the record of the world, but it's not too
bad either. Okay, so this thing,
what we just did in the chaos engineering parlance
would be called steady state things, is what the
metric that we have, which is time per request.
An Sli time per request is
roughly ninety five. If we run it again, we're probably going to get
slightly different number. But I would expect that this is not going to
be very different because we didn't change anything.
Okay, so let's just finish the ten seconds.
Actually, now it's fifty milliseconds. I guess it was warming
up a little bit, in which case I'm going
to run it again so that we can actually verify that
the steady state is okay. Ten seconds
is not particularly long. So it looks like there was some cache
warm up. And now we have about fifty milliseconds time
per request. Brilliant. All right,
so it would be a shame now if someone
went ahead and introduced some failure, right? So we have two
components. We have the database and the
engine, the blogging engine. And what
we might want to do is just see what happens if we
introduce some slowness, some delay
between the two at runtime.
Right? And as it turns out, it's actually pretty simple.
We can do it pretty easily. One of the things that make it easy
is this tool called Pumba. It's an open source resilience
tool that essentially, apart from doing things
like killing containers, wraps up TC
traffic control Linux utility in a really cool way.
So I'm not going
to go too much into the detail, but I just
wanted to know that what you can do is you
can ask Pumbaa to actually run these TC from
inside of a container that's attached to the container
that you care about to introduce slowness.
So what we want to do is we want to
introduce slowness on this container and
we're going to go ahead and actually introduce that to all the networking.
And with Pumba that's pretty simple.
Actually. I already have
a command here, whats I use, but I use this NETM
subcommand. We're going to run it for one hundred and
twenty seconds the entire experiment.
We're going to specify a TC image because Pumba
allows you to either rely on TC being
available in that container that you're targeting,
or if you're using someone else's container or
you just don't want to have TC there, you can start another
container like I just described. Like for example this one,
that chaos TC built in to connect to that other
container and execute
TC in there. And then we're going to add a delay
time delay of one hundred milliseconds. And we're going to ignore
jitter and correlation for now both
set to zero so that we can
just sre the results more easily. And then the nice feature
is that you can specify these the container name or if
you prepend it with re two colon,
you can use regular expression. So for example,
my Meower underscore DB is going to match everything
that starts with it or includes that. So in particular
these name of Meowerdb that we were looking at before,
the one here is going to be matched. All right.
So I'm going to go ahead and run it. And what
it should do is start that other container,
execute the stuff and end.
So I'm going to start another tab so
that we can see what's going on. And the
funny thing is, an interesting thing is that you see
these actual container being created and exiting nineteen
seconds ago. And that container
executes DC. And when this command is done,
it's actually going to go ahead and execute another container
that's also going to be visible here.
So I'm going to go ahead and I'm going to rerun
the same AB on port eighty
eighty to verify our state right now with
the one hundred milliseconds added.
And boom, look, we went from fifty
milliseconds roughly to five
hundred milliseconds. So from roughly two
hundred requests in ten seconds to
just eighteen. So what happened here, right,
you could have expected, we can also verify that rerun
whats just to make sure that we get consistent results.
But what happened is whats you might have
expected, the one hundred milliseconds in
between the database and these
ghost container to translate into an
extra one hundred milliseconds of delay to
the user. But what actually happened is that we
got almost five hundred milliseconds of
delay. So we got a multiple of
that. And the reason for that is that ghost
probably talks to the database more than once,
even for the index page that we are querying.
And that means that if that container's
networking is slowed down by one hundred milliseconds, what we're
actually going to see is closer to five hundred
milliseconds delay. So just to confirm these,
I'm going to run it again so that we can see whats we're
back to fifty milliseconds because the pumba setup is
done.
So about sixty milliseconds, which is good enough. And then
if you look at the Docker PSA,
you can see that we have the other one, which the
first one was TcQdisk add, and then the
next one was TCQdisc delete that
exited thirty nine seconds ago.
Okay. This is how Pumba was able to actually
affect the networking of the
container that was running an image that
we didn't instrument in any way.
So this is like a really very short version
of this demo. But my goal here was to just kind of show
you how easy it can be with the right tooling and the right knowledge
to verify things like that. And now we know that
if we can expect reasonably,
the database networking to have delays like
one hundred milliseconds, that will affect our overall
delay for the ghost
setup by much more than one hundred milliseconds,
and in particular by something closer to five hundred milliseconds.
So if the delay rose to a second, we could probably
expect to actually sre something closer to
five seconds rather than just one. Okay, and this is my second
demo where I would like to show you
a little bit more on the Kubernetes side
of things for all the people who are
using Kubernetes to deploy their software. So let's take a
quick look at slos and Kubernetes. The purpose
of the second demo is to show you how
useful chaos engineering can be for sres
to verify their slos and to detect
breaches in their slos.
So I've got here mini cube set
up, just a basic one with a single master
here, and I've got a bunch of pods
running. Also, nothing out of extraordinary.
This is just the stuff that minikube starts
with.
And what I'm going to do is I'm going to use a tool
called powerful seal. It's something
that I wrote a while back,
and it's a tool for chaos engineering
for kubernetes specifically. If you've not
used things before, I recommend going to GitHub and giving it a
try. But basically what it does is that it allows
you to write this yaml descriptions
of scenarios. And then for each of these scenarios
you can configure a bunch of things that you can do to verify
that your assumptions are correct. And if they're not
correct it's going to error and you can alert on that.
So if you want to get started with that,
there's a get started click
quick little tutorial. But the kind of
most important stuff is about writing policies in things section
here, where you can see the different examples of the different policies that
you can see that you can implement using
powerful Sil. And if you are
wondering what the syntax looks like,
there is an up to date, automatically generated documentation
that shows you what kind of things you can do. So if you do
scenarios sres, then you can see the kind of things that
are available to you. So probe, HTTP,
kubectl, production node, action weight and stuff like
that. But this is for another day. I just wanted to kind of give you
a quick insight into where
to look for those kind of things. But let's take
a look at whats actually looks like in
action. So back to our little mini cube setup.
I have a seal already available here
that I preinstalled and
I also prepared two little examples
of a policy. So I'm going to start with
these. Hello world. And this is whats it looks like.
It's a simple yaml with scenario,
a single scenario called count pods, not in
the running state. And what it does is that
it has a single step with pod action.
And inside of the pod action there's always these things
that you can do. You match a certain initial set of
pods, you can filter them by
whatever property or whatever filters that you feel like.
So in our case, I'm going to match all the pods from all the
namespaces, and then I'm going to pick the ones
that have the state property that is negative,
not running. And then I'm going to count these
and I'm going to verify that the count is always zero.
So what it's going to do for me, we show the
git pods, all of them were running.
So this kind of verification,
very simplistic here, shows how you can kind
of continuously verify that the assumption that
you make the assumption being all pods sre running is actually
true, and you can do that with the twenty lines of yaml. So in
order to run this, we're just going to do seal autonomous.
To invoke the autonomous mode and
these we need to specify the policy file
and this is simply done by the policy file
flag. If we run that powerful seal is
going to connect our cluster. And whats you can see here is that it
matched the namespace star, so it matched all the
namespaces matched eleven pods in that
corresponding to the pod that we found here.
And it found an empty set after the
running negative true. So the filtered set
length is zero and the scenario finished successfully
by default. It's also going to go ahead and sleep for
some time and retry that later on, which is also configurable.
So if we just wanted to verify that this
is actually working, what we could do is remove the negative
and verify that it's failing. If we
try to count the running state and the checkpoint,
the count is not zero. So if we run it
now, this should fail.
Complaining whats we got? Eleven pods
instead of zero, which is exactly what we saw here.
And you can configure powerful seal to either fail
quickly if this happens, or if you want
this kind of ongoing stuff, it can produce metrics that you can later
scrape. So with just a few lines
of yaml we're able to verify NSLO,
which is kind of silly, all pods running,
but gives you an example of what you can do
with this kind of thing. So let's do another example,
a little bit more complex than that. I prepared
another one called policy one for you here and
let's take a look. So this time we
actually specify these run strategy.
So we want to just run it once we got an exit strategy
fail fast. And the scenario is a little bit more complex
this time. So what I'm trying to verify here
is that the deployment SLO is
that after a new deployment and a service are scheduled,
it can be called within thirty seconds. So let's
say that I designed my Kubernetes setup and I designed
all of the bricks in a way that I am fairly confident
that at any given time, when I schedule a new
deployment and a corresponding service, within thirty seconds
everything will be up and running and I'll be able to actually call
it. So the way that I implement that is through the Kubectl
action. Kubectl action lets you more or less
specify the payload and these action. So apply or delete.
It's an equivalent to Kubectl, apply f
of the standard input.
It also allows you to
automatically delete at the end so that you can clean up and
so that you don't leave some kind of artifact
after you're done with that. So the payload here is a little bit
more complex. It's actually deploying another application
that I wrote that is very
useful for kubernetes. It's called Goldpinger.
And Goldpinger allows you to basically
deploy an instance of Goldpinger per
node by using a demo set typically.
Or you can deploy it more or less
whenever you like, whichever way you like.
But the default use case is that use a demo set
so that you run an instance of Goldfinger per
node and these, they continuously create
this full graph of connectivity
between these nodes. Whats you can
use to verify whether there is any issues connecting
on whether your networking is slower between
certain nodes and stuff like that. So this is like a
drop in that you can run on your cluster and you can
verify this kind of things. It also produces metrics
and things like heat maps and stuff like
that. But going
back to our example, in order for that to work, it is a
service account so that it can list pods,
so that every Goldpinger instance can actually see what
other Goldpinger instances are there to send
pings to. And then we've got the deployment and the deployment
is fairly standard. Right now I only have a
single node, so I'm just going to deploy a single replica.
It has a selector, it uses service account that we just set
up and a bunch of variables here that are not particularly relevant
to us right now. This is just to make sure that things
working. It also comes with a liveness probe and readiness probe,
so that we know that if
we can ping it, whats means that it was able to verify
the probe initially. And finally we've got
a gold pinger service,
a service that we're going to use to actually
issue a request. And then after that, this is where
our slo kicks in. We verify
that after thirty seconds.
We expect that. So the magic number here, our magic
range is between zero and thirty seconds. And finally
this is where the verification happens. We have an HTTP probe
that calls the helps the endpoint of
the Goldpinger service in the
default namespace, which is the one that we defined just
here. So all in all what it's going to do, it's going
to go create the thing,
wait thirty seconds and then issue these HTTP
request to verify that it gets a response
on the particular port. Okay, so with
that we can go ahead and this
time instead of hello world we're just going to run the policy one.
But before we do that actually just to show
you, we're going to do get pod aw in
the background so that we get all the new pods
that come up and all the paths that are
being terminated. So whats it's actually visible to you too.
So again our seal and we have these
policy one yaml I'm just
going to go ahead and run it. So it starts,
it read the scenario. You can see that it started
created the deployment and here our
kubectl in the background is actually displaying these new
pod that is already running after four seconds and
now we've got about twenty five seconds to wait.
So if there was some kind of elevator music
that will be good. It's running for twenty five seconds
so we're not that far off. And these making
a call, powerful shield.
Try to make the call. It got a response. You can see
the response generated by a gold finger scenario finished,
cleanup started. As you can see that's the thing, whats I
was describing before the auto delete, it deletes all the things
the pod gets terminated and powerful
seal carries on.
So if we list our pods again we can see that
it's actually terminated already. The Goldfinger.
And if you run this continuously you'll
be able to verify that your slo of thirty seconds for
a new pod coming up is actually being satisfied
or not depending on what's going on.
So I don't want it to be too deep
of a dive but if
you want to dive deeper that's absolutely great.
I would recommend going to the powerful seal
documentation back in the browser here and
just at least go through the different examples here we have like the new pod
startup, we get the pod reschedule where we
actually go ahead and we kill a pod and then we wait
a certain amount of time and then we verify that the pod is running.
Powerful silk can also integrate
with cloud providers. So things like Aws,
Azure, OpenStack,
Google Cloud, there are drivers for that.
So you can say things like node action like
this. You can say for example
pick all the masters, pick the masters that
are up and take a random sample of size one to just take a
single master that is up and stop that
thing. And then we can verify that
things continue working the way that we want it. And if you
want you can put them back up explicitly
like that. Or there's also in the stop action you
can do auto restart et cetera.
Et cetera. And that's all
I had for you today. Once again,
go grab my book. If you want to reach out,
there's my contact details available there. If you have
any questions, I'm happy to chat. And hopefully
I'm going to just leave you with this new tool that you can use.
And if you are an SRE, you should be using it.
If you're not an SRE and you would like to become one, this is
something that's going to help you with that. Thank you very much
and see you next time.