Transcript
This transcript was autogenerated. To make changes, submit a PR.
Good afternoon, everyone. It's nice to be here.
I know it's, it's been a
long day. There's been a lot of good
information, a lot of talking, a lot of testing, a lot of things.
Why I'm here is a good question that
you haven't asked yet, but I'm going to answer because is
why this talk is here.
I've been working with AWS technology for about twelve years.
I joined AWS four years ago and I worked a lot with AWS,
a solutions architect, actually the first year with some of the biggest
gaming companies in the nordic, and we
were working a lot on scalability and some of their resiliency
issues and trying to find solutions and also to
work trying to get some of those chaos engineering teams actually get started.
And I realized because there are patterns
that you can see in what's blocking people from actually
adopting some of those cool practices, and one of
them is chaos engineering. So this talk is really a collection of
kind of tips and tricks to help you get started
in your journey. If you're already started, it doesn't mean these talk
is going to be totally useful. Hopefully.
Actually it might enable you even further so you can accelerate the
adoption of your journey. So how many of you
have active chaos engineering practices currently?
Right, and we're talking chaos engineering,
right, not just chaos practices,
because that's pretty much good. And how many
of you have a hard time to actually accelerate the
adoption within the company? All right,
good. So there's three people here. This talk is for you.
All right, so in November,
actually, before reinvent, I did ask a
question on Twitter, because I was having a lot
of conversation around chaos engineering, and I was really trying to confirm
some of those private conversation I was having with the rest
of the world and trying to see what's happening.
And the question is, what's preventing
you from having chaos engineering
basically widely adopted in your organization?
And quite interesting, the first answer was enough chaos in
prod, right? Which I think everyone can agree,
can agree with that, despite this joke of is
it chaos engineering or chaos is so funny is
because we are all having really chaos in production,
or we tend to call production chaos because it's never really a stable
system. A follow up question actually was
why do you have chaos in production? So what do you think is
these biggest reason for having chaos any?
I guess that might be a factor, but quite interesting
is mostly fighting fire. So people
actually are pretty much reacting to outages
and trying to fix things rather than improving
things. And chaos engineering is often seen as an
extra burden to implement on top of what we have.
And I think this is probably one of the first reason why it's
actually very hard to implement. Chaos engineering is because people call
it something extra to do on what you
already do, which is not necessarily bad, but it's actually slowing
down your adoption. So this is basically
what people think about chaos engineering. Actually, the first problem
with chaos engineering is in the title
itself. Don't call it chaos engineering.
And it's been expected today is actually
call it engineering. And I'll give you a simple reason for
that. Every single customer that has trouble
with chaos engineering and actually having widely adopted
inside the company is because the leadership
team or the sea level team see chaos engineering as
chaos. If you just tell the word chaos
to your leadership team, never get
on trust. So it's a bit similar to one of the talk we had
today around shadowingly
fixing your company. Chaos engineering is kind of works
in the shadow. You're doing engineering.
It's actually engineering. So don't call it chaos engineering
to start with. And this is not really a tip, it's really a
requirement. So that's why it's called on the number zero.
These second really important thing, in my opinion,
is change the focus of chaos engineering and
don't focus on the discipline itself. Actually, you have
to have a wider view to start with on what you're trying
to do and how Chaos engineering works. I'll give you
an example. This is one of my favorite story, and this is an
internal story of Amazon and which really, really reflects
that statement, to have a wider view.
The thing is, good intentions don't work. An intention is
people want to do good. So how many of you go to work every morning
trying to say, today I'm going to do shitty work?
How many? Right, so you all want to do good.
Actually, as people, we often,
and as engineers, we often want to do good.
So good intentions don't work. But let me show you
why it doesn't as well.
It's because people already want good. There's a story behind all
that. And at Amazon, there's something which we call the customer service
training. It's actually every amazonian that joins the company
has the capability, after maybe three or four months,
to go and shadow customer
service reps and see
what customers are really dealing with, what the
problem they have is kind of a way not to lose
touch with our customers and get immediate feedback.
So Jeff Bessels actually did that and does that
regularly. But in one of those occurrence, he was sitting next to
a rep, very experienced rep, and she
pulled one of the tickets. She looks at it very quickly,
turns it to Jeff Bezos and say, I'm sure
she's going to return that table. Jeff was like,
what? You're like a magician and stuff. I just know.
Okay, so they go through the call, these order,
and kid you not actually the person wanted
to return the table. It actually happened that the table had
scratches and she wanted to return.
So it goes,
Jeff Bezos asked the rep after the call is like,
how did you know that she wanted to return the table?
And she said, well, we just had thousands of those
in the last two weeks. And he looks at her like, what?
How come we can have thousands of similar cases
and not act on it, right? It doesn't
make sense. So he did what every managed
does. You go to the team, to the leadership, you say,
become on guys, we need to do better. We need to care
more about our customers. We need to try
harder. Well, guess what happened?
Well, nothing happened actually. It just didn't work
because good intentions don't work because she
was and every other reps was already trying
to do the best because they go to work trying to do the best.
So if good intentions don't work, work what does?
Before answering the question, I want to tell you about a little story,
and that's these story of Toyota.
Do you know what an undone cord is?
Right? An undone cord was invented by Toyota
in the early 19 two,
those are silk, silk weaver loom.
So basically, Toyota before making chaos, was making
clothes for geishas. And geishas have high
quality standards. And if there's any
defect in the production line on silk, it has to be stopped
immediately so that geishas get the top quality clothing.
So what they did is they invented a little button
on the side. You see the little cord on the left that anybody
in the production could pull if they could
see a problem on the silk and everyone
could pull that. And they took this practice actually
a little further on manufacturing lines.
And this is actually a manufacturing line for Toyota.
And you see those cords up. These are
cords that you see all along the manufacturing line.
And anyone on the factory floor,
if they see a defect, can pull the
cord. Now, if you do this in Europe,
you get your leadership, come yell at you, you are stopping
your production line. It's better be good because we are losing millions and
millions of dollars. Well,
Toyota has a different way of doing that. When anyone
pulls the cord, the leadership comes and say, thank you very much
for pulling the cord. That means you care about our customers.
So that means they empower anyone on the production line to care even more.
It's a cultural thing. So that's called the undone cord. And we
took that principle and tried to employ that
into customers service because of course our rep
had seen that thousand times. So she
could have pulled the plug and see, oh, that's the ten times
I'm doing this item. Let's pull that plug and take this item
out of the catalog. And so we did.
So for some unit, this is actually one of the unit that
was put on the catalog that I'd mistake and contacts
fell from 30%, 33% to 3% within
a couple of days. Right.
And these are kind of interesting ideas. So if
good intentions don't work, work what
does? That's another ondon corn
mechanism that worked actually on prime Video, which is quite funny.
You can receive sometimes emails and it'll tell you we are
refunding you because we noticed that the quality of the movie was
not the best you could. And these are all automatic
emails, right? So there's no one looking at the logs
and say, oh, the quality of the delivery of that
person was bad. Now those are system that analyze log in
real time and say, hey, if you are experiencing issue, we're refunding
the movie. All those are mechanism
to actually improve customer service.
And when you think about chaos
engineering, it's actually very similar.
All your company, all your developers,
they want to do good, right? So if you have
an engineering practices that is stalling,
the first thing you need to think of is not necessarily people
and not asking them to try harder at testing or do better.
It's because you're missing a mechanism.
And this is very, very important because it's changing the
focal point of where the problem is from people
which already want to do good to a lack of mechanism.
And if you think about mechanism, there are
three things in a mechanism. The tool. Obviously you need the
tool to implement a mechanism, but then you need to have an
adoption framework.
How are people going to use your tool? And actually how do you enforce
people to use your tool? And these best, how do you
do not enforce it, but how do you make it that
it's subconscious like they can't do
without, or it's automatic. And then you need to
audit, of course, because you create a tool, you have an
adoption, and then you need to audit that. So when you think about chaos engineering,
don't only think about the tools, because very
often when you think about that, it means you're looking
at the wrong problem. And we'll talk about mechanism a
little bit later. Another tip
is, well, change starts with
beginning with understanding. And that's both personal.
If you want to change as a human being, you need to understand what
you're doing, and that's the same for your system.
And we talked about monitoring a lot. But I could ask
you that, what are the top,
most, top five painful experience outages that
you've had in the last two years?
Are you actually able to give me data that
backs up your claim or do you have a gut feeling? Well,
you'll be surprised. Pretty much nine
companies out of ten, it's going to be gut feeling. And very often
it's time bias. So is it going to remember the last two or
three outage that were really painful? But if you look at a long
time, you can see very, very different set of things.
So very often
people will tell you what they think
it is, but it's not really what it is.
So have a way to
really measure and to look at the past. Outage a
tool that analyzes your coes, for example, or we call it Coe. Your postmortems,
we call it Coe corrections of error. And I'll talk a little bit. But your
post mortem, it's good to write it, but do
some analysis on it. Try to find pattern.
And a pattern is not. Bob deleted
a database in production or Adrian deleted database in production.
I did it twice. For real. Never got fired.
You know why I never got fired? Because I was not blamed.
Because I was set free to be able to do it,
which there was no rail guards. It's like I
could run any command without having confirm command
line that would confirm that. You really want a database to delete the database
in production? Yes. It was
not like this. And my terminal was exactly the same for test production
environment. There was no different colors. There was no such a thing. So it's not
my fault actually, because even though I deleted the database,
I was being pulled every 3 seconds to answer
question. I was not let alone to concentrate. There was fires everywhere.
So yeah, the consequences is my stress level.
My attention made me delete these database, but I should not have
been able to do that. So some
of those painful things. I mean,
how many of you ever had an outage of SSL certificates?
I love that one. This has nothing to do
with coding skills, right. It's about process.
It's a mechanism actually to enforce that. Your SSL
certificates are either rotated or the alarms on it. So it's
again a lack of process, it's always DNS.
Well, definitely is configuration drift,
right? We all have infrastructure as code,
right? But how
many of you are allowed to log into a machine on SSH
and do configuration challenges? Why do you use
infrastructure as code?
Why? Because if you do it, you're thinking immutable
infrastructure is good, but then you let people
mutate it.
Right? So all this is all about good intentions.
And you see where I'm going here, right? It's actually
a lot of it has nothing to do with human. It's about processes,
detecting things and the same third party providers.
Very often you have a dependency that's going to fail. Well,
again, it's about having a process that monitors that. Maybe it's a circuit breaker,
maybe it's a higher level process. Whatever it is or it's
having. Multiple providers don't
rely on one.
Imagine if Tesco was relying only on one provider
of twix bars or sneakers,
and that provider gets bought by its competitors. What does it mean?
That for the next six months they can't provide that particular food?
No, of course they have a diversity of providers. So this is
the same for you. You have to have a mechanism to identify those problems.
Every time we have an outage at Amazon or AWS or
the business, we go through what is
called the postmortem. We call this internally corrections of errors.
It's basically, and I'm going to piss
people off, it's five wise,
but not five wise, because within our culture we don't
stop at five wise. It's just the name we gave to the process because initially
it was called firewalls. But basically it's an analysis
of what happened, what was the impact on the customer,
how many of your customers have been impacted. Basically it's the
blast radius, really understanding the blast radius
of what were them, what were, what were the
contributing factors. And this is very, very important.
And this is one of the problem of the five y's and why we have
to stop calling it five y's. And I'll
explain this at the end. Technology is a lot more complicated than just tools.
It's actually culture processes and
tools. When you do an analysis of
root cause analysis, very often people stop at the tools
or the processes,
but actually you have to look at the entire set of pitches. So you have
to ask a lot of whys in many different direction, in many
different kind of universe, the cultural universe,
the tool universe, the process universe. Right.
If you find one contributing factor,
it's not enough. You have to dig out, you have to ask more questions,
more complicated questions. Also we have the discussion
blameless, right? It's actually not blameless because you want
to know who did the mistake. You want to understand it means it's
consequence less. You're not going to fire someone
because of mistake, but you want to understand what that person did
this, how he was allowed to basically do that problem,
if it's an operator problem. So never ever stop at an operator problem.
It's a false sense of responsibility. It's not there.
Right. The responsibility is often on the
mechanism, on the culture, or the wrong set of tools.
Never really these people. Then you have
to have data to support that stuff, right? If you don't have data,
you're navigating basically in the assume world,
assuming is death.
What were the lessons learned? Usually there are a lot of them.
Again, it's the realm of having culture,
tools and processes always think about that, because this is where technology
lives. And what are the corrective actions?
These corrective actions for us, we always assign a date.
By default, it's two weeks, but you can have way
faster dates or a bit longer, depending on the task at hand.
And this is very important because it defines the auditing.
So then we can actually have weekly reviews
and those weekly reviews. Interestingly enough, these are with the upper
leadership team. Actually, Andy Jassie regularly goes into those
technology review. Andy Jassy is CEO of AWS.
And now there are a lot of service teams, right? So we use
a wheel of fortune where every team, if they
get selected, they have to go on stage and present their metrics,
their operations, what they have done to fix the particular
problem that were outlined in the CoE or
something like this. And it's
really meant to share things. So all
the service teams are actually around the table like this. The management is
here and these. Everyone has to present. It needs random. The wheel
of fortune is a weighted random wheel of fortune. So if
you get called two or three times, eventually it's going to go
to someone else because your weight is going
to be higher. It really is
a process also to spread knowledge across the different teams
and identify maybe what could be the next experiment
to run or continuous experiment, continuous verification to avoid having
this problem. And on
the adoption part, so we have the tools,
we have the auditing and adoption part.
We use something called the policy engine. And basically
the policy engine is a tool to amplify social pressure.
It's quite interesting tool. It collects all
the data of an environment from
the best practices that were understood.
We codify those best practices into scripts that scrap environments within
AWS in every team and then return a score
for a particular service team or architecture
or software practices. And then if
it doesn't implement those best practice
which are continuously monitored, then its scores
go down and then those dashboards appear on the leadership
team weekly meeting. Right. So then you have to justify how come
you are not following your best practices.
And this is not because people don't
want to do it, but it's because we tend as people
to push things forward and if we
have pressure from our peers, we always want
to get out. Aws a good kid, right? Whereas if you actually
don't have those reviews and if you don't have
social pressure, we tend to take a shortcut and just say,
maybe later, maybe later. And I'm sure you've all been in
that situation when someone's going to verify what
you do and someone is actually peers at your level,
it's something which pushes
you to do a little bit better.
Not that you don't want to do good, but you're still
human, right? So these are some of the mechanism that
we use. Another one that is another tip that
is super important. When you adopt chaos engineering
practices, you're never ever going to go from zero
to 100 in your company. Actually, you have to
find the right Troy and horse, and that Troy and horse is going to be
your team, which is going to be spreading
the good virus inside the company.
Never choose the best team. You know what, because they're already doing great
and when they do great, it's hard to justify
the work that you're going to do with chaos engineering. So very often
I see companies say, oh, this is the best team. They have the best practice
list, start chaos engineering with them, and then there's very little
noticeable differences because they're already doing great.
And then the other team are not really inspired because there's nothing
magical that happened. So that's one thing is people
expect change to be magical or wow
or something that is going to transcend their
developer experience. If you take these worst team, these have way other
problems than dealing with chaos engineering, like having
infrastructure as code in place or shutting down port 22.
So you can't log into your instance or things like these.
So choose a team in the middle, which team is going to serve
best your interest. And when you have that team,
you have to find the right metric.
And I'm often asked, okay, which metric is best?
If I start with a team, if I
have to choose one metric, it's going to be MTTR,
because that's meantime to recovery. So that means
how fast the team is going to be able to deal
with an outage and recover from it.
And that's only because of confidence. And this is,
for me, the essence of chaos engineering is training the team
to be confident. That's confidence in
the application, but also in the tools and in their process
to actually deal with outage. How many
have had first outage in production?
You've never had outages in production or
leave, right? Okay. Now, how many of you had outages in
production? Right? How many of you during those
outage felt that they've lost all their
capacity to, you think they started sweating,
they started swearing, they basically became really stupid.
But that's what outages do in production.
Chaos engineering kind of limits that,
right? You're still going to be scared because it's still an outage in production.
There are still real customers and you care, actually you're scared
because you care for the customer and you care for your work, which is great.
So MTTR is a great metric, because the less you
are scared these faster you get into action,
the better you're thinking, then the faster you're able to recover.
And chaos engineering is definitely high
on that. So if there's one metric to follow at
these beginning before you understand, or you get others, like avability or
things like this, MTTR is always, always a good default.
And you don't have to have 20 metrics, have one
which is solid, and that one is actually quite solid
because that impacts directly your availability,
because when you're down, the availability is only improved by your meantime
to recovery. And that's super important to understand.
There's another thing which is quite interesting as well.
So during the life cycle of chaos injuring
you basically start from the steady state, you make an
hypothesis, you run your experiment, you want to verify
your experiment and then improve it. And that's continuous learning
cycle that we talked about all day.
When you start, you actually don't have to have
anything else than hypothesis.
I'll tell you how I run hypothesis meetings in my
company or in companies that I work with. I put people
in a room. And that's not just the engineer, it's actually
the project manager, the CTO, the CEO,
everyone that is somehow related to the project, to the
stuff that we want to test or what the service, the product,
everyone in the room, from designers to
I love your helmet. It's not distracting
at all.
It's quite funny, it's good.
So I'll put everyone in the room right. Not just the engineers.
And what is interesting to do here is actually ask
them to write on the paper what
they think is going to be that result of the hypothesis.
So for example, my hypothesis is what happens if my database
goes down. So you write on the paper
within the timeline, you don't talk to anyone else and
you write what is there. And you know why
it's super important to write on the paper?
Because when people talk to each other,
there's something called convergence of ideas that's
related to diversity and to diversity
of people. Right. There's strong mindset people
that will push their ideas and introvert people
will naturally like me. You'll be surprised. I'm really
introvert. I need to step back and think a lot
before I can say anything in a meeting.
Who is extrovert?
They'll start talking a lot.
And if they are really convincing, you have convergence of ideas and
everyone tends to think, oh yeah, that sounds about right. Yeah, that's what's
going to happen. This is zero information for you.
It's useless. What you want is write it down
because then you have a divergence of ideas and you realize that
everyone has different idea of what
happens if something goes down. And that's 100%
of the time. I yet to have run a meeting where I've done that.
Everyone knew exactly what was happening. And you
can just stop here. This is your beauty of getting started with chaos
engineering, because you have to understand how on earth is it
possible that everyone thinks differently. So it probably means
that specifications were not complete,
documentation was not right or they have changed.
Right. And the changes haven't propagated.
So if you have product that takes months to do,
it might be some developers might got
stuck at the specification from the beginning and haven't
necessarily caught up with the new ones. Or then
you are new these and you have assumptions,
right? So overemphasis
on the hypothesis,
I kid you not. This is probably going to be the wow moment in
your company. You're going to fix most of these issues, these, you don't really
have to run anything experiments because this is
already going to fix a lot of issue because it's
going to trigger some questions which then you're going to investigate,
which is actually quite beautiful. The fifth
is do introduce chaos engineering.
Very early in these process I see company having
beautiful processes, focusing on CI
CD pipelines and then thinking about chaos engineering.
It's the same thing if you create an application and say, oh, maybe I should
make it secure. So chaos engineering
is actually job. It's not zero because security is, but it
should be. Job one or two should be really high on the process.
And that means when you hire people,
teach them how to break stuff,
actually create dev environment where you can let people say
Docker stuff and see what happened. Are you playing
with just running your system in your local environment
and playing Docker stuff and see what happened? That will teach you a
lot. Actually, the first thing I was
doing when I was hiring teams in my previous company is the
first week they had to run a small program with three APIs. It was
a product API. So get
post and delete and health check and then
trying to make it as resilient as possible. And that was the
only guideline. And they could use all the tricks
within the dev environment. The only thing is it was the Docker environment.
Things like these. It changes the way developers think
because it triggers something in the mind that
actually there's more than just working, there's like breaking. And it
goes back to when you try to understand how a radio or television
worked. You often have to break it to see how
it works. Right? And this is the same. There's few comments. I love
to get these developer to start with. Docker kill,
Docker stop is just beautiful.
Another one is did
the authentic yourself? It's great to run
tons of process. For example, on your health check API and
see how it behavior. Or on your API and see
does my health check answer? Because if
you have an API in the health check API and these are hammered
and your cp is 100%, which one do you favor?
Do you want to answer the API or the health check?
Well, you should actually think about it because it's actually one of the big
problem in distributed systems is
prioritizations of requests when systems are congested. Because if
you don't answer the health check and you answer your API,
well, the load balancer is going to take the machine out of
the auto scaling group, for example, and then you have nothing left.
So degradation is a good way. Burning cpus, for example. This is stressng.
It was talked today about it's an evolution of the stress
API, of the stress tool. TC is definitely a good
one to add latency. So there's a bunch of tools where you
can start playing locally in your environment.
You don't have to do this in prod. This will teach you a whole lot
of things, especially how to treat your
system. Another very important thing is when
things go down, have a reflex to look at the blast radius
and understand it. That means that
then when people will start building, they will start to have
this mindset of thinking about the potential blast radius of this.
That means architecture, that means API.
So it's really trying to create a culture where
everything is around blast radius reduction. Because if
you do things with as minimal blast radius possible, it means less customers
are going to be affected and chaos engineering is exactly the
same. When you do your experiment, think what is the
smallest blast radius possible that I can do
to actually disprove or prove an experiment?
We've seen this today. Never assume,
right? And if you assume something,
it's probably broken. If you haven't verified
it's assume it's broken. That's basically the language.
And there are a few things in the cloud that I see continuously.
People assume work and that goes with
managed services. Actually, I think that's a problem of managed services.
Even though it's super important for you to innovate faster,
they give you a sense of belief that everything is going
to be okay because the burden on
managing that service is on somewhere else. And when
it's AWS, yeah, we manage it pretty well
most of the time. But failure happen and they will happen.
And s these did fail a couple of times since
2006. And the effect was dramatic is because people had
never experienced an s three outage before. So they
discovered new failure modes that they were not familiar
with. And these are kind of some of those failure modes
that I've seen the most, at least on AWS, because this is
kind of the audience I talk to. Assuming,
for example, multi AZ, you have a region, you have three AZ. How many
of you use AWS? These just to get a sense, right. So on
AWS you have a region, you have three aws to have your application
to be fault tolerant. And people
assume an AZ just never has problem. But it does.
Even though they're isolated, sometimes they do. So test,
you're resistant to one AZ failure. We had discussion today,
how do we do this? Well, you can push your
subnets to have zero traffic in the network. For example,
you saw people here. I know usually people are like, what the hell are you
doing with verifying people? So I'm not breaking
necks of people. That's not what I do.
Identify top people in the teams,
then I take them out of the equation because
people over rely on the ten
x developer, right?
And you'll be surprised. I do this experiment
very, very often. The last one was last year, November,
October. And I took the guy, it was great.
I took his laptop, sent him home 6 hours later. We had to
bring him back urgently. Actually, he was the only one who knew how to do
a particular magic trick to get back the
database up and running. Or he had the key,
or no one else had.
So don't stop at it.
Experiment or computer experiment, think about people, because this
will highlight weaknesses in your process and mechanism.
So the problem here was just they didn't have a mechanism to ensure
that everyone in the team had the same level of knowledge and
they couldn't share it, because there
was one guy that was just doing it himself and,
well, even though it was great, he was never really telling
what he was doing right, and it was all right
for other people. And this is just to
start all those things. Actually, I realized even though there is gremlin,
and I use gremlin a lot, I use the chaos toolkit a lot, pumbaa,
all those things. There were some missing stuff.
So I went to write a bunch of open source
stuff to help people do failure injection.
This talk is not about failure injection, because failure injection, in my
opinion, is just a tiny part of chaos engineering. But these are
the tools that I wrote to be able to do the verification that I
was talking about. So these AZ failure is in a chaos
script. You can randomly kill database instances,
elasticache cluster full AWS.
If you're using a serverless and you do Python, I wrote
an injection library error
injection library to do in lambda can return
different HTTP code and stuff like these.
I can do demos later because I'm already a
bit out of time. So if you're interested in any of that, I'll be here
hanging out. So I can show you some demos, but you'll
find everything on my GitHub. I just didn't want to
push too much AWS stuff, because in the audiences, not everyone
is AWS. It doesn't make sense to focus too much on that.
So in summary,
my biggest, let's say, suggestion, getting out of
these starting blocks blocks with chaos engineering is actually take a step back and
realize this still doesn't work.
But realize actually really the chaos engineering,
it really is a set of an intersection
between culture, tools and processes, and it really sits in the
middle. And if you have problem with the adoption
of this, take a step back, try to analyze,
maybe there's something in your culture that is missing.
Chaos engineering needs very strong ownership,
really strong deep dive characteristics,
very strong high standards and
bias for action or whatever. This is what we do, how we call
our leadership principle, for example, inside Amazon. But that
will define who you're going to hire, right? So maybe you're just not hiring
the right people. Maybe your culture is not set right yet and maybe you can
transform it to match a little bit more
what chaos engineering in your company is. And I think there's no
blueprint. I can't tell you exactly what sort of
people you need to hire to run chaos engineering because it's going to depend on
your company culture. Then you're going to have to define tools.
And there's plenty of tools. There's new
tools coming up every day. So which ones are using.
It's really depending on your environment where you're doing, but there's never
going to be one. It's going to be a set of one. The only thing
I can tell is try to make these tools uniform
for the entire company. Most of the time that I've seen
failures as well is because people use so many different set of tools
and then the adoption of those tools is very hard because
there's so many of them. So choose the right one. What works
for the particular verification that you want to set and
then start with that. Then at the end, once you're really focusing
the chaos engineering practice, then you can have
more. But at least to start with, focus on few and these
are the ones that are going to give you most gains. So understand your
past outage, figure out the patterns. And then it's not the
low hanging fruit, but it's the one that has tried to fix things that have
the biggest blast radius and then get the tools right for that and
the right culture. But don't forget processes because
this is super, super important and we talked about those mechanism, right?
So it's a complete process. So you have the tools,
these adoption and the editing part of it, and that's
pretty much it. Thank you very much. I write a lot on medium if you
want to follow on Twitter, and I'm going to be here hanging out.
Remember, I'm a bit introvert, so even though I speak
on stage, I'm actually introvert.
But I'm happy to talk with you. Thank you very much.