Transcript
This transcript was autogenerated. To make changes, submit a PR.
So we want to turn these surprises into
non surprises or something. How are we going to do that?
We've got a software tool for you. It's chaos,
chaos, chaos, chaos, chaos, chaos monkey for spring Boot,
Fletcher, I'm Manuel Wessner. We work for codecentric AG
and we in Germany and we are maintainers
of this piece of software. So we want to kick straight
off with a live demo and show you a few things about how you can
use it. This is just a really simple application we
wrote. If you want to use a spring boot plugin,
then you're going to need a spring boot application to test it with. So we've
written a very simple application which men will explain.
Yeah, it's just a very basic spring boot application.
We got an endpoint, which is just movies.
The app is about recommending a movie for a user.
Just take any streaming service like do
you want to watch inception or some other movies?
So it just has a normal rest controller with one
endpoint. This endpoint calls a
service get movie. This in terms
get recommended movie, which is just a
plain list of predefined movies, just very basic.
And for simplicity's sake we just take a random
movie off that list and the movies are provided by
a database. Yeah. So in theory it's getting it
based on user preferences or something. It's a recommender.
Okay, so now if you could just show us with Postman.
We're using Postman if you're familiar. Yeah,
we want to start application. Exactly. And we're
using postman if you're not familiar with it. It just helps you to make rest
calls and see the JSOn output that comes
back. So we just make a request to
this endpoint. Okay, so application functioning normally.
Now let's see what we can do. Chaos, chaos, chaos, chaos, chaos, chaos monkey for
spring Boot started with that. Yeah. Simplest way is
like to add a dependency to your existing application.
So either it doesn't matter if you have gradle
or maven, just add it. So we have
one dependency, chaos monkey, spring boot.
You add that. So now you got the library.
It still needs some config. I prepared a little bit of
config. I will explain you. Right on.
So that's our application properties.
So just having the dependency in there is not going to change anything. Initially you
need to activate the spring profile. Chaos monkey first of all.
And then you're going to need to activate, you need to enable chaos monkey
second line, you need to enable it actively.
So then we have kind of watcher. This means what
you're going to attack, you have different automation.
For example, in this case we attack service.
So we attack this class which is
annotated with service. We have
an assault level. John can. Yeah,
better than mine. The level means how many of the incoming requests
will be attacked. So one means every single request. If you put
five there, it'll be one out of every five requests.
Approximately, it's done. Not every fifth,
but random about roundabout every fifth will be attacked.
So yeah, and the last line is like we
attack using latency, so we inject latency
to our service. That's kind of different assaults
we have. So let's see how it goes. We move to Postman
again and do the same request where we
had a few milliseconds. Now it turns out to be
2 seconds. I do another one.
Yeah. And if you see the requests will actually take slightly different times
because the latency is by default it's a random amount of latency. It'll be added
between one and 3 seconds.
Yeah, so we can change that. Yeah,
we can change that. Like we can
introduce a fixed delay like
by just defining a couple of additional properties.
So basically range, start and end just
define the same amount of time. So we assume
3 seconds it's going to take to get a
movie out of our endpoint.
So our application started,
first request takes a little bit longer. Yeah, this one will take a
bit more than 3 seconds. That's four,
but the next one.
So yeah, 3 seconds.
All right. Yeah, well we've shown how obviously you've seen
we're restarting the application every time, but if we want to do real
experiments, it is a bit inconvenient to have to restart your application
to start the experiment and stop the experiment. So it would be nice
if we could just have the application running and at runtime we could
tell, okay, now we want to introduce, we want to have this assault running
and then we want to turn it off again. So what we can do
is enable actuator which is like
some endpoints where you can by an tests endpoint
enable and disable certain things like chaos monkey for
example. And we can configure our attacks
and our watchers like using postman for
example. Yeah, my tests core. Yeah. So actuator is just a built in part of
spring boot. So there are lots of different functions of spring boot which use actuator
chaos chaos monkey for spring boot plugs into
that we just need two add the settings which expose
this tests endpoint for configured by any
tests client. All right,
so let's see now we
have it enabled so
we can configure now this stuff using postman.
Yeah, so now we're coding to configure an exception assault. So we saw the latency
assault just before and another assault which we can do is exception
which means basically the requests are
going to be coding to throw an exception. When the request comes in the application
will throw an exception. Yeah. So we have this assaults
endpoint which we pass a config JSON
config. Every second request gets
attacked. We don't want latency,
we just want exceptions two throw. So we activate that
and we want to have it on rest controls.
So we have defined it here, we just want to have
it on tests controls, no other type of annotation.
So we now don't need, maybe we could just take a second to look at
that, what we're seeing there. So they're the different options for what
you can configure. So you can attack assault controllers,
rest controllers, service repository
and component.
Obviously component actually technically speaking every service,
for example is a component in spring, but it will attack the
things that actually have that annotation on them when you enable that particular thing.
So let's see how it affects
our endpoint. As we see, first request we
got an internal server error, we got a chaos monkey runtime exception.
So second request goes well. So we
have level two. So about every second request gets
attacked. Yeah. Cool.
All right then let's
say we want to talk about like our application is talking to a database.
So we want to say maybe a theoretical
thing could be that we can't access the database,
right. We, we don't get a connection to the database when we want to get
the user's preference. Now because we can attack repository
beans we can run an attack for that. Let's do that.
So we just configure now only the repository, not just
tests controller, just the repository because it's
actually our database or access to our
database. And look how it goes.
We assume like the same result.
Yeah. Okay so now
we wanted to put that into an experiments. Okay so
let's say it's all very fun. We've been clicking around
here and it seems to be working but in
reality it would be nice, we don't want to be doing that much clicking.
It would be nice if we had a kind of a script which could
just execute it for us. So let's enable that,
attack that assault, let's see how the application responds
and then let's disable that and return to normal functionality.
Yeah, we prepared a little bit kind of experiments.
It's using chaos toolkit which talked. Ross Meyer just
talked about us. It's kind of a
JSON description how you want to have your experiments. It got
a title. So we assume movie recommendation when database is
down, quite important. A steady
started means like if everything is all right, we suppose a
movie is recommended. So how we can check that
we can request a movie on this endpoint should
get a result in less than a second and
it should return HTTP status code 200.
That's that. So we're basically saying that at any
point in time, no matter what's happening, our system should be
responding to that request with 200 and should be giving some movie.
So it's kind of asserting like testing
and now are actually testing.
What we do is like first we enable chaos monkey.
The good thing is Chaos toolkit has an integration
for Chaos monkey so you don't need two actually make
the post request. We just saw it postman,
you just can pass the actuator URL
and you chain use the enable chaos monkey function.
After that you configure the salts. So what
we do now, we attack every request.
We want to have exceptions. And to make it a little bit more realistic,
we have a connect exception and the
arguments are like the parameter of the exception we pass here.
So we pass a string which connection timed
out. So to make it more like the exception
you would actually get if you can't connect to the database and
we want two have that now a regular post.
Because that feature isn't implemented, it would be probably great
for full request. We need to
be able to update the watches from the Chaos
toolkit because previously it's a
more recent update. Chaos Chaos monkey for spring boot after
you start spring the application, you weren't able to change the watches.
You weren't able to say that now the service watcher is active or now
the red, you had to do it at started. And that's why I think the
Chaos toolkit doesn't support chain that in the experiments.
But now, oh yeah, if you call the changing of
those values, as long as you can change them at some point
for the API call, then you should be able two change them inside the method
if you want to change them during the execution. Really? Okay.
I don't know if the underlying API has that in there yet. Yeah, I think
that's the point. I think it's still possible by chaos toolkit,
but it's not like with the function change or
soil configuration because it's quite a new
feature I guess to changing that watches at runtime
before it wasn't possible and that's probably the reason.
So here we pass the same arguments like
we just want to have it on repository and it's good to go.
All right. So I will just make a first request
to see how it goes. Now I will run the experiment.
Right. So at the moment the Application is running
with no chaos attacks, assaults configured.
Is that right? I think it's still from the old one.
Like the old exceptions. Yeah.
Okay so maybe
do I need to reset it? Let's see. I think this is going to fail.
Let's see, let's see. It's live coding so we
assume it can fail.
So yeah how you read this is like
in the beginning you look at a steady state movies recommended
probe. We actually try it. It says
it's not in a given tolerance, means it took longer than 1 second
to answer. Then in the end we
roll back. I haven't showed you that there is a rollback function
which is actually like at the end of each
run you do that, you disable chaos monkey.
Yeah. So this one didn't actually even start the experiments because the steady started wasn't
valid even. Exactly.
Now it's disabled chaos monkey. So now it will run. Yeah. So we
disabled chaos monkey. We don't have any attacks. So I suppose
now the steady state works.
And what do we have? Now it looks
good. We have steady state. The steady state is met.
So actually it can request a movie in less than a second.
Then we enable the chaos monk, configure the text
and the watchers. So we do that
again. Steady state is not in a given tolerance
because we now get exceptions. The repository means throwing exception
because you can't connect to the database. Exactly. So we
roll back everything. We disable Chaos monk. In the result of chaos toolkit
steady started has deviated, a weakness have been discovered and we
have a weakness in this case. Yes. We cannot recommend movies.
We can show that in postman two. It's like now crashing
with no, because you've disabled chaos. But it doesn't matter. It's not
requesting movies. Correct. It's not delivering movies but we want two.
We want to deliver movies all the time. Even when the database
is down we can't get the movie which the user might be recommended to
the user based on his viewing preferences. But we
want to recommend him some kind of a movie that everybody likes.
So your favorite movie probably. I know what it would be.
Titanic. Titanic.
So we prepared a little bit of fallback logic.
It's on our master branch.
So on our movie service,
we have now Titanic as our fablest
fault bank movie.
So the get movie method, we just adjusted.
So how does you read that? It's actually using a
library called Waiver. It's part of a resilience
library. Resilience for J. Basically it's
just a simpler method of writing. Try catch
exception just at one line. So what
is this we call get recommend movie? And if there
is any exception throw, we just return
our fallback movie Titanic.
All right, so I
think we just still configured the attacks,
right? So we can. We can run the show
it in postman first. Oh, let's see,
let's see. You have to restart the app first.
I did restart it, but it's still. You would need to reconfigure the
attack. Yeah, true. So let's run the
experiment. That's right away.
So let's run it again.
So now we see steady
state is met. We enable chaos mong. We configured ourselves again.
The repository watches, and it's still going.
So we now return Titanic, and it's a movie
recommendation somehow, at least. And we disable
chaos monk again. And now we have an experiment completed.
Cool. So if you want to run that more frequently,
if you want to run that inside of a script or build chain or something,
that chaos toolkit won't give you sort of like a zero or
one kind of response, like a return code. But you
can grep for certain text. At least that's what I got to.
You can grep for text, which will tell you, because it prints certain text if
your experiment fails or doesn't or passes. And you chain
grep for that. And Grep will give you a response code to tell you whether
that worked. Unless there's been an update from Russ's
side. Not yet. So we
talked boot before. Yeah. So that's one way you can do that.
Yeah. So is that the end of what we wanted to demo? Yeah,
I think so, as far as I know.
Okay, so we've got a few slides that we want to share with you as
well about basically where the project chaos chaos.
Chaos chaos Monkey for spring boot. Find our demo
case in this GitHub URL. Is this thing plugged in?
Yeah, it's supposed to work.
No, not anymore. Well, we can use the. I can.
Hold on. Yeah, you can use them. That's fine.
I know why there's this on off switch on the side you
chain. Put it to on. Okay, here we go. So a bit of history this
project was started by a colleague of ours called Benjamin Wilms.
And he was, a few years ago, he was building
fallbacks and things. He was building like circuit breakers and
other resilience patterns into his applications, but he wanted to
test whether that stuff actually works. I've been, by the way, in various places
where they're building this stuff, and nobody ever tested whether any of these things actually
worked. So maybe you've experienced that too. Well, Benjamin wanted
to know whether the stuff he was building actually worked, and he had no easy
way to do it. And he found out there are a few tools out there
which he had to install, and it was the various complications in using them.
So he said, well, why don't I just write something for spring boot? And so
that's why he started this application. And with the help
of this plugin, he was able to test his resilience patterns.
He didn't need to install anything on the servers.
That's one of the things which is great about using this plugin. And we
didn't need any permissions from anyone. He could just get up and get running with
that. So it was successful for him. He actually
has now gone on two start to work on
a startup. So our company, Codecentric said to him,
management said to him, why don't you go and start a startup about
chaos engineering? So that's what he did. That's kind of cool when management says
that to you and sends you off. And so he
actually passed it on to us. The project said, can someone else maintain it?
So there's a few of us from codecentric Ag that are maintaining the
project. And we're pretty responsive
to issues and basically we're pretty active
there. Yeah, that's our little advertising. Please get involved
if you want. If you're interested in chaos engineering, don't know where to get started.
Well, one thing you could do is have a look at our project and
maybe commit something. As said, there's a few of us involved and
we're pretty active, so any helps. Appreciate it.
Yeah. All right. Talking about recent changes,
we made chaos, chaos, chaos, chaos, chaos monkey for
spring Boot release. And we got a new feature. So we want to introduce
a couple. So we have two different types
of assaults. One is
like the request assaults we just saw. On every request we
do something like latency or other stuff like
exception. And we also have like runtime assaults,
which means, for example, with this config, you see,
you can kill our application every hour.
So that's
the config and our new feature was like
having cron expressions to schedule attacks.
It's like the original Chaos monkey
bit, which from Netflix we just turned down
your application at some random.
Yeah, it's like that because it doesn't really make sense to have this kind of
assault based on incoming requests,
but rather based on a timing and the next one as well wouldn't be good.
On each request, you kill your app memory
assault, which here we consume
memory to a certain amount you configured. So, for example,
with this coding, we fill 5% of memory
every second until we reach 95% of memory
and hold it for like 40 seconds. You can
test things like out of memory exception. How does your application behave
with that? This assault is a little bit flaky because
different jvms and different versions of Java and things
act differently with garbage collecting and stuff. So we had some pain with it.
So definitely play around with it. But just be warned, it doesn't always work
perfectly. Quite hard.
Boot the roadmap. Yeah. So what's upcoming? Well, one of the things which,
I mean, we showed you a few demos. Now,
to keep it simple, we just kept it all inside one application. But obviously one
of the main things you're interested in is applications talking to each
other over the network. And often maybe you'll be attacking
a system which is a back end for a client and you want to see
how that client reacts or something. Well, in this case we
thought about outgoing HTTP calls. Now, in spring you
use rest template or web client. They're the two classes that
spring gives you to make HTTP calls going out. So we want to attack
those calls going out over the network and introduce latency or problems and
things onto those calls. And the other thing is reactive.
I've done a bit of work with reactive applications in spring boot, and so
I think some of the things we're doing there will work directly with reactive as
well. Makes sense, but it's a couple of things which we need to rethink how
we're doing that for what makes sense if you're doing chaos engineering
with reactive applications. So that's also what's coming up
soon.
Yeah. So now that was chaos, chaos,
chaos monkey for spring boot. Think that it's a great way
to get started, because if you're wondering if you're using spring
and you want to get started with chaos engineering, you don't need to install any
tools, so you don't need any special permissions. You can just get
up and running in a few minutes like we just showed you. This is a
fantastic way. Two, get running to get started with chaos engineering.
Yeah, even the dependency step is just optional. You can even
include it in just like comment line. If you put Java minus char,
you can put it on a class loader. So you don't need to
even have this chaos monkey dependency in your production
environment in your palm. Yeah, so those
are some of the advantages. Chaos, chaos. Chaos monkey for spring
boot. Obviously, if you're not using Java, or if
there might be other reasons where you're going to need some other tools. So we
thought we'd just point out a few other things that are out there.
Traffic control. You want to do some low level Linux kernel
stuff, you can use traffic control. Look at that crazy thing. We showed
you how the nice way to introduce latency with our
application, but this is at an infrastructure level.
This crazy command here is going to introduce latency
on the ethernet e zero interface,
basically. So, yeah, maybe here a letter off will
take some problems too with it.
All right, we have another tool which is called
stress cpu. It's also a common line tool.
It's in most Linux distributions,
included here again, we have like
producing a high cpu load. It's quite
a pain to get the hang of it, of all
these options. So here we like for 10
seconds, I think on two cpus we introduce 128
megabytes load.
Another more cooler tool is
Pumbaa. Pumba attacks docker
containers mostly. It has these commands
available. So you can make the same like
emulate network delays, you can
pause maintainers, you can stop and kill them and even remove.
Yeah, for comparison, there's the latency thing
again, but with Pumba. So introducing latency onto connections
for that particular docker. You can even install it on kubernetes,
as demon said. So you have Pumbaa available on your whole cluster,
if that's what you want. It's kind of dangerous sometimes.
Then it's kind of cool. If you look out there for chaos engineering tools.
There's some crazy stuff out there. Like a lot of people have just sort of
mucked around a bit and started some hobby project. And it
sort of half works or doesn't work. There's better stuff and not so
good stuff. There are not so a lot of tools in general in chaos
engineering, but my personal favorite,
cube invaders. So it's kind of playing
space invaders on. You connected space invaders
on your kubernetes platform. And the aliens represent pods,
which are containers. So as your
spaceship, as you kill the aliens, your Kubernetes
pods will get killed. So don't
do it in production, I guess.
Yeah. And then Netflix was
the one that kicked this stuff off. And then
all the kind of big cloud people like Amazon and everyone there, they're all
doing chaos engineering. Well, there's another big cloud
provider called Alibaba, and these guys are
doing chaos engineering, too. So recently they published Chaos Blade.
We haven't investigated or we just found out about it just recently, and I
don't even know if you want to use it. It looks pretty good from what
it says in the documentation. Yeah, it has like, different things.
It can even attack c plus plus application, it can attack Java
applications. It can attack Docker like the
Pumba thing, and it even can attack some cloud stuff.
Sadly, we're not so fluent in Chinese,
to be honest. The documentation is mostly in English, so we just picked
it up. I like this diagram. See these things here?
Either it says it works with this, or it says, installs.
Bitcoin miner, we're not sure which.
Yeah, there we go. We've got someone that speaks
Mandarin. Excellent. So there you go.
Allegedly, we don't know whether it works at all. We never tried
it, but that could be at least from videos and
screenshots, it looks quite well, to be honest.
Yeah, we don't know. We need to try soon.
Then you can do chaos as a service. Okay. We had,
Russ just mentioned his chaos
IQ stuff. So they
were talking about platforms that will give you. So you don't have to muck
around with that, like tc minus minus,
network minus, root minus, I don't know what else. And getting your things
wrong and getting one thing wrong and destroying your entire production cluster
or whatever it is. Okay. We're talking about platforms which help you via kind of
a web interface, two schedule.
You need to have knowledge over a lot of different tools, how you
call them, how you use them. So sometimes it's
easier to do because in the end, you're not trying to become an expert on
low level tools. You're trying to actually create resilient systems.
So you can talk to Russ about his platform. There's chaos mesh,
which is the startup we just mentioned from our former colleague
that he's just kicked off. I mean, that's literally, the website went up properly
a few weeks ago, but he's already got a couple of people
on the platform, and that gives you a
screenshot. There's a couple of screenshots on his website, can show point and click
a few things. And that's what I want to do. Go see
the history of what I've run before and everything else you can integrate it into
your build chain. Like there's a hook that your build chain
will trigger stuff on the software platform, on the platform.
So that's all out there, too. Pretty cool stuff. Yeah, it's cool.
So you don't need to know how stress cpu command is.
Stress memory. You just can click it on your platform and
does it for you. That's it? That's it.
Thanks for watching.