Transcript
This transcript was autogenerated. To make changes, submit a PR.
It's. It's pretty early, right?
Feels early to me. I come to London once a year at the moment.
This is the only time in 2020.
And I had the issue of living on the south coast of England,
which, if anyone's familiar with that, it's about. Takes about as long as it takes
to get to the east or west of London from here.
About 2 hours door to door. So it's been an early start, which means
ready and very awake, and I imagine you may not
be quite so. So I'm going to try and make this as easier
a way of getting into the day as possible because there's a lot of fantastic
talks coming. Okay, I'm going to do that because
I can't see anything unless I do that. Has anyone seen me speak before?
Raise your hand if you have. Excellent. So some of you have seen me speak
before. That's great. I can't completely reinvent myself then.
Damn it. So I'll actually have to keep to the story.
I can't see my slides yet. Is there a reason they're not up on here
yet? I plugged everything in. They are now.
They are brilliant. You're a star. Thank youre. Right. So this
talk is a slight adjustment on a talk I've been doing for a little while
now. And it's pretty useful to sort of box
chaos engineering and to talk about two things, really. One, what chaosiq engineering
is and how we should approach it, and two, how we might
go beyond it. That's the new part today. Now, I'm not
saying that youre come to a chaos engineering conference and somebody's
already saying, go beyond it. And you're probably sitting there going,
I don't know what it is yet. Or if you do know what it is,
you're probably going, I hope he says it's what I'm being,
but that's all good. I'm actually not going to say, oh, youre do
this. And then here's the really good part. At the end, I'm going to be
saying, okay, everything you're doing is right, but there is
a different way of packaging it that just might
make it easier for your company to endorse and explore
and be happy to put a budget on. So,
okay, let's get going. Me, for those that don't know
you, I am quite happy with risk as a
motorcyclist. That's what I intend to embrace. I embrace risk on roads.
Anyone here a motorcyclist? Excellent, brother.
How youre doing? Okay, so I ride a Harley Davidson. So it's sort of a
motorcyclist. It's not really. It's a couch on two
wheels, but it's a lovely bike. And this was actually me
on a much smaller bike. I was over in Tibet riding to Mount
Everest. You have to do it on the north side, you have to do it
in Tibet because on the south side there is no road.
So I rode a motorcycle to Mount Everest. And on my way to Mount Everest,
I got some wonderful opportunities, some beautiful pictures. And all of them
led me to a greater understanding of this troublesome activity
that we're involved in called system delivery and system development.
Someone's died next door.
It's not as bad as. I've had some great analogies in my talks where something
goes on in the background and we actually had a fire alarm go off once.
And I thought, that's perfect for chaos engineering. It's a fire alarm and you just
watch everyone's reaction. Quick detour in a fire alarm,
what do you think everyone's first reaction is? Check their
watches. Is it 11:00 a.m.
If it is, it's a drill. Very few fires have spontaneously
started on the dots of 11:00 a.m. So usually
go, oh, yeah, it's a drill. Even if it's not, even if
they see smoke, it's a drill. So, yeah, I've had that. That's actually
a good little starting point. Fire drills are a terrible analogy for chaos
engineering. It's often used, you practice these
things because they matter, because when it actually happens, you'll be prepared. Does anyone know
the basic problem with fire drills? They make you
complacent. They make you think you know how it's
all going to go. I'll tell you a very, very quick, tragic story.
Very much. Very much, sir. I've got the most interesting heckler in the world.
Brilliant. Don't normally get heckling in tech talks.
So this is all the story. This is about a lady who had absolutely no
ability in her brain to become complacent. Whatever the Wiring was in,
the brain could become complacent. And she happened to work
very, very high up on one of the twin towers. And obviously
the fire alarm went off. The yellow tower had got a problem. Everyone was told
to leave. She did. She got up,
she walked the stairs, and she walked from the very top of the tower to
the bottom. And she survived. While she was going there,
everyone else on her floor didn't have this different wiring
in their brain. What were they doing? They were finishing their phone calls.
They were looking at their screens and wondering if it was for real. They were
picking up their photos of the family. One person walked with her down a few
flights of stairs and then came back because they've forgotten something
at their desk. They've forgotten their laptop. None of
those people, none of those people made
it out of that building. So complacency is
dangerous. When I talk about today, I talk, but how
chaos engineering should be viewed. And I'm very careful because I call it a practice.
Youre a practitioner of this, but be very sensitive
to complacency when you practice something,
just because that's what happens in fire drills, is we get so used
to them going off, we literally go, right. All those things you're
told what youre told. Absolutely don't do, don't pick up your laptop, don't just leave
it all go. And you won't because you're complacent.
So, yeah, just, that's the downside. That's the flip side, the nasty side of chaos
injuring. Right. Back to the story. Back in Tibet, I saw
some wonderful things. I saw this.
Now, that looks like it's going along a plain, but actually it's going down a
very, very steep hill. It's just one of those photos that's a bit confusing.
The beautiful thing about this is it gives me several analogies for software
delivery and system delivery. Number one, that's the
best analogy for a software roadmap I've ever seen.
And actually, I go further. Most software roadmaps I've got have got lots more
dead ends, like side roads that go off a cliff.
A lot of those, but they're not on this one. So it's not perfect,
but it's not bad. The other thing that you can't seen so well on this
beauty is that some of those corners doesn't have tarmac on
it. Now, I'm a motorcyclist.
Your biggest fear about going around a corner, leaning over is
lack of grip. And they've taken the tarmac
off on the corners. Okay, so why have they done that?
It's because the water, when in the rainy season, when there's lots of water,
tunnels its way down, this actually takes off that top
layer. And so what they've done is make it safer. We'll remove it completely.
Not sure about that thinking, but it was there.
The other thing is, you go around those corners and what they've done is the
Tibetans, in their absolute wisdom, have looked at this and gone,
it's a bit dangerous. We need to let people know
we've taken the tarmac off. What should we do? Should we create a sign
maybe 100 yards before it? No, what we should
do is create. Put little rocks about that big little
boulders around it like it's dangerous here.
They've turned a mildly dangerous chunk of road into a
lethal hazard. Brilliance.
And we do that in software all the time.
We go, oh, it's a bit difficult here, you mustn't change it. So what will
we do? We're putting a readme in a repo somewhere. No one will see that
we're constantly leaving traps in our world. And interestingly enough,
there's lots of funny things with safety when it comes to
chaos engineering and how it relates to safety and how we're building safety in
our systems. And this theory of making something safer
means we can often make it a lot more dangerous
through complacency, or as in this case,
rocks around a hazard. Anyway, that was part of it.
I learned that. But what I learned even better is I learned what people think
production looks like. Serene,
beautiful. I took this photo from the foothills opposite Mount
Everest. So lucky it was a clear day.
Now, you could argue if this was an analogy for production. There's a lot of
cloud there. Hiding a lot of sins. That's also true,
but that is what we think we've got. You must have seen the PowerPoint slides
that someone puts in front of you and says, look, here's production.
Three boxes, straight arrows.
Gorgeous. Where's the complexity? It's not
even there. It's perfect. And then I got
the real analogy for production, though. So that was in front of me and admittedly
not immediately behind me, because when you get to seen the picture, you'll realize
that would be disgusting. But there was behind
me, in about 20 or 30 meters behind. Good distance
from it, by the way. Again, it makes sense in a minute.
There was the real production because this is what was behind
me, some distance behind me. This is the
toilet at Mount Everest,
not to be gotten near for about 23
and a half hours of the day. There's one half hour period you
can go near production. I mean, the toilet. And I will not
tell you now why that half hour period exists,
but needless to say, it has a lot to do with burnt yak poo.
And it becomes safe. You can actually see it. It's completely
and utterly. You could own that toilet all day long.
And then the moment the yak poo goes on, the queue forms.
The queue can be up to half a mile long. So, yeah,
you don't get the end. Okay, so that's the truth, right? Most people on
stage don't explain what production is like. They say,
oh yes, look what we did. I remember going to talks over
and over again at big conferences where people say,
this is production. We've done this, we've learned this, we've learned that.
It's been wonderful, it's been great. They're lying.
It's been hell and they need to tell you that.
And so I'm hoping that more do and chaosiq engineering helps with that because it
makes our hell much more shareable and much more comparable.
So let's get into it a bit more. Okay, so what
happens usually, and let's talk a little bit about
the trends at the moment because everyone in this industry absolutely loves new terminology for
old stuff. Okay, we're going to do microservices.
No, it's not, no, no, it's not Corba. It's RPC, but it's not
Corba. We're going to call it Google something. Those sorts
of things happen. So we're going to do something new. It's going to be different.
We're following new trend. New trend at the moment is cloud native. Let's go cloud
native. Because then it's someone else's problem, right?
It's someone else's. If we've got people from aws here, they'll tell you it's
their problem and then they'll explain why much of it is yours.
I'm glad he's laughing and not looking at me with evils, although I couldn't
tell. Okay, so you're going to go and do
something new. You're going to build something new perhaps. Or if it means youre got
an existing system that you need to improve the reliability of, either way you've
got something and you're going to try and do something cool. And chaos engineering is
in the same bracket. Right? Quick message. Can we stop
inventing terminology? We're terrible
at it. Okay, let's talk about some
of this stuff. Services,
I'm not going to ask you what you think of microservices because I don't know
and I teach a class on it. Very first thing I do is
say, I don't care how big they are. So what does
that leave us with? Services? We're doing SoA, no, we're not. We can't do that
because that has really screwed up. So now we're doing something different.
We know it's something different, but we don't really know what it is, so we
call it services because that's a good idea.
That is not the worst, though. My personal nemesis
is we took a beautiful renaissance in data access
and persistence technologies, and we did a beautiful
number on it. We turned it from being a renaissance,
a wonderful realization that we had all of this pleasure of different ways of
storing and retrieving data. And we called it by what
it isn't. No SQL.
Worse than that, we didn't even call it by what it isn't. We call it
by what it isn't some of the time,
and SQL was never the problem in the first place.
So, yes, what a wonderful group of namers we
are. Chaos engineering. Let's talk about that one.
Okay. How many people here have been in conversation and thought, for the 80th
time, I've now got to say, no, I'm not engineering
chaos because youre not.
That's easy. It's called delivery.
In some companies, it's called release
day. I work with one company, they said, we do continuous delivery.
We deliver continuously. Brilliant.
My job is done. This is easy. I'm here to help you do great software
development practices, and you're doing continuous delivery already. You must be very good at this.
He said, yes, we are. We continuously deliver
once a year,
continuously. Oh, man.
Anyway, so naming things chaosiq
engineering, we're not engineering chaos. Let me explain what we are doing.
Okay. The problem with anything you're doing at the moment, what's the problem with any
of the systems we're working on, is if you go in something like cloud native,
there's a lot of moving parts. Now, this is a diagram that I have to
admit I haven't read all of, but it has an awful lot of stuff in
it that says, this is exactly how you approach cloud native. And there are other
things that you could approach. Waterfall's got plenty in there. If you're doing waterfall,
if you're doing continuous waterfall, a waterfall
is basically Arkansas. Anyway, it doesn't matter. So you've got all
those things that you could be doing, and there's a lot of moving parts in
anything you adopt in our industry. So there's lots of movement
and lots of possibilities for interesting surprises.
Okay. And even when you do everything right, because I imagine everyone
here does everything right all the time. I'm sure they
do. I do. No, I don't. Even when
you think you've done everything right, this is production.
Nothing like a good game of Thrones meme for production. It's dark,
it's full of terrors. Anyone who worked in operations can tell
you this. I can tell you exactly what that's like at 06:00 a.m. When it
goes wrong. 06:00 a.m. On a Sunday happens to be their birthday. That sort of
level of 06:00 a.m. It's dark and it's full of terrors.
Okay. Why? If we do everything right,
surely we should be avoiding this. I get this a lot from business owners.
Look, I've spent a lot of money on really great engineering talent and
you're telling me this we're still going to end up with that?
Yes, you are. It's essential
in what you're going to have. So why is that?
This, this is my analogy for production. This is my preferred one.
In the foreground is what you designed. You have a field,
the bunnies could, you can imagine bunnies running around that field. You can imagine
it being a calm, beautiful day. And the background there is
your users. In the background
is your cloud provider. In the background
is your administrators. It's all
of the factors that can be surprising in your world.
And I'm very careful with that word surprising because people use the word incident
a lot. And incident is a dangerous word
because it doesn't recognize the unpredictable natural occurrence
that it can be. Let me talk about that briefly. Right. Has anyone here
had that moment where a boss comes up to you and says, right,
you are responsible for this incident? Why did this
incident happen? Now if you're pretty professional,
your reaction to that would be, I screwed up.
We must have done something wrong. Okay, I must have an answer why this
incident happened. Therefore I must be able to tell you what
did we do wrong? Okay, don't use the word incident ever.
It's gone from your minds.
Use surprise.
Imagine that boss turning around to you this next time going,
why did that surprise happen?
It's a surprise. They do happen.
We could call it shit, but we're not allowed to. Sorry for the camera.
Okay, so we call them surprises. Refer to these things as surprise because they're surprising.
No one sits there and go, yeah, I knew that incident was going to happen.
If they do, they're a sadist. Eject them from your team,
okay? Or talk to them about it. So this is what production
looks like. You've got turbulence in the background. That's your users, it's your admins,
it's you, it's everyone involved in your system. And when I refer to system,
I mean the whole sociotechnical system, I mean the people the practices,
the processes, the infrastructure, the applications, the platforms,
the whole merry mess.
When you add all that up, I don't care if you're doing the simplest piece
of software in the world. Well, maybe if you're doing hello world, you're probably okay.
But if you're doing the simplest piece of production software that makes money, then you've
probably got a complex world and turbulence is part
of the game. And the beautiful thing about turbulence
and the chaos that goes with it is you can't predict it. You have to
react and get good at reacting to it.
Okay, so I'm going to completely abuse Kinevin now to
explain what sort of systems we're dealing with, just to make sure we all
understand the problem, the challenge we're facing. Number one,
you do not have an obvious system unless you are
writing hello world for a living. And if you are fabulous,
well done, keep doing it. Someone's paying you for that. You've got it
right somehow. Your career path has been brilliant.
Most of us don't get away with that. But it's really attractive to be
down and obvious because it's best practices. I can turn around to you,
just say, just do this and you'll be better.
How wonderful is that? How many agile coaches
have you had? Turn around and go, just do this, you'll be better. By the
way, they shouldn't have the label agile coach at that point because they're not
really helping you or coaching you. But set that aside,
okay? We see that a lot in our industry. Just do this. It worked for
me, therefore it will work for you. I don't need to know
your context. Just do it. You'll be better.
So few examples of that in our world,
let alone in software. Okay, so forget that. Leave that behind.
Maybe you'd be forgiven at a dinner party. To describe what you do for a
living has complicated what do you do? I work on complicated
stuff, software. And you watch their eyes begin to say,
and they say, do you work on Netflix? No. Do you work at Apple?
No. Or do you work on bank trading systems
and you know they're gone. Right. And it's complicated what you do. You don't want
to explain it. Dinner party applications are usually gross simplifications.
So youre be forgiven for thinking you're there. And maybe good practices would be a
good thing. Okay, you've got good practices. What a good practice
is for me is something you can apply. You could try
it and it will probably work. Experiment with
it, play with it, see if it works for your world, it's got a bit
of context sprinkled on like salt. Okay,
bad news for you though. If you're building any sort of distributed
system, and frankly you are, this is one of the few things I don't need
to see. You're running a production system. I don't care if it's a monolith,
parts of it are distributed. Unless you're actually embedding the database into your
code, which is relatively unlikely in production, but can
be done, then you have a distributed system of some sort.
And if you have any distribution that makes things more complex.
If you've got external dependencies, well, if you're running on the cloud,
you do. It's called the cloud. Everything there is an
external dependency. Well done. You've increased your
surface area for failure. God, I love Siri.
You've increased the surface area of failure.
Okay, so you're in complex, probably,
and that's okay. When you've got the sociotechnical system, all the people, the practices,
the processes, the distribution, the external dependencies, you're in complex
difficulty with complex. We've all seen it. When someone comes in and says,
just do this to the system and it will be better. And it isn't,
you know, that better has to emerge over time.
You have to try something in there for probably an extended period
to see if it really does have the impact you want. You have to apply
such thought processes as systems thinking.
It's a harder thing to work with and you know, you can't just
go in and go tinker. It's better.
You have to assess it over time, verify it over time.
But I got bad news for you. You see, if we were just in
complex, we could still logically rationalize about it.
We could still look at it and go with enough time. We could look
at all the different pathways through it and prove it works under quite
a lot, if not all, conditions. The difficulty
we have is that usually if you're agile,
you have a system that should be evolving quickly,
frequently changing. Not those people I mentioned earlier who do continuous delivery
once a year. Youre usually are delivering
more frequently than that, which means your system is changing more frequently than
that. And again, when I say system, I don't just mean the software or
the technical aspects, I mean the people, the practices and processes that surround
it. All of that is changing frequently or can
be changing frequently. And when you've got that,
you are in chaos where the smallest change can have
a big impact. And even then you may not be able to assess all
of the impacts a change might have.
So with that level of complexity and chaosiq present in your
world, and I can almost guarantee if you're running production systems,
you have this, because if you have incidents, sorry,
surprises. If you have those, then you're
seeing the outcomes of it. Okay.
The difficulty here in chaos is that novel
changes. You can do something and be surprised by how much better it is,
or worse, you can make that small, simple change to how people
operate and suddenly see a massive and dramatic impact on everything you've
done. So it's a difficult place to
be. So wouldn't it be great, right? Wouldn't it be great if we had a
way of engineering ourselves out of chaosiq?
Wouldn't it be great? Although we shouldn't call
it chaos engineering because that sounds like we're creating it,
right? So chaos engineering is engineering ourselves
out of chaos through practice.
So let's talk about that. It's about
learning chaos engineering doesn't exist, shouldn't exist unless you
emphasize the need to learn. We're doing these
things for reasons. Actually,
recently, I've started to call chaos engineering by a slightly different
name. I've started to call it system verification.
I want to verify how my system reacts
to certain conditions because I don't know yet,
and I want some proof of it, and I want to be able to turn
that into actions where I might be able to experiment with improvements
on my world. So it's a learning loop. It's a double loop learning loop
because you're actually assessing your assumptions to begin with.
And this is my favorite. Youre got to watch good. Anyone here watch good omens?
Yes, absolutely. Everyone should watch it. It's brilliant. But, yeah,
the idea here, it's the old idea with chaos that the smallest change can cause
the most dramatic and surprising impacts. It's the stuff we deal
with, and it's in our systems all the time. Okay, you are working,
if you want a technical phrase, youre working with nonlinear dynamic systems.
That's the sort of stuff that gets you a PhD.
So it's useful to have that in mind, because if you say we're dealing with
a chaotic system, most people look at it and go, it's not really.
But if you say it's got properties of being nonlinear and dynamic,
then that makes more sense. Yes, it does. Pretty much inarguable
in most systems. The only systems I know of that are not,
don't follow these sort of properties and are making people money tend to be systems
that are no longer changing at all. They're very few,
and they're usually being retired rather quickly. But even when you're retiring a
system, you end up changing it. So that goes back here too.
Okay. And as engineers, one of the things we
love to know is that the system will work. We're not here to create
systems that don't work. Usually, maybe,
but that'd be an odd part of your degree.
And it's the same with physicists. We like to think under
these conditions we know what's going to happen next. But if you have a nonlinear,
dynamic system, you don't know what's going to happen next.
You cannot predict its behavior in a
variety of conditions. You have to apply the conditions
to see what happens.
So, incidents of surprises. Let's go back to the theme
being wrong. Now, it's obviously a bit tongue in cheek, because no one trains to
be wrong, do they? Interestingly enough,
I do a short module at Oxford University on
this, and I call it the how to be wrong. As a software developers
course, very few people come on it and
I try not to depress them in 10 seconds. But I essentially say everything
you know, everything you've been taught, you're going to hit the real world, it's going
to be wrong. You're going to find it's very different when you get out
there and things are icky, things will not work, people will not
listen. Doesn't matter how many times you turn around and say, this is a good
way of doing this, they won't listen. I still
have to now tell people to do TDD. I don't know why,
and sometimes I lose the will to live, but this is the way
the real world is now, I in that module,
when I proposed it to Oxford, I said, I'm going to teach people how to
be wrong. I has looked at like I was a madman.
Maybe I am, but I thought it was a great skill.
To me, it's a superpower. I am regularly wrong.
At university, I was supposed to never be wrong. You're rewarded for being
right all the time. I teach people how
to be wrong for a living and so let's talk about that.
So what is it to be wrong and why are we scared of
it? If you look at the dictionary definitions, it's about not being
correct or true. Already I'm not liking it.
You're being incorrect. Most the students on my course, they sit
there and go, this is horrible. Now I don't want it. This is like a
cross versus a tick, okay?
Gets worse. It could be injurious, unfair or unjust.
Now borderline. You're being very naughty.
Okay, go one step further. Inflicting harm
without due provocation or just cause. Now you're hurting
somebody. This is what being wrong means to us in the english
language. And it gets one step worse because now the last
one is a violation or invasion of legal rights of another. Now you're
going to jail for it. So we've really packaged
being wrong as a terrible, terrible, terrible, terrible thing,
and yet it's a superpower. So let me explain why that is.
First of all, when are we ever wrong, right? I told you, you're all geniuses.
Youre all here early doors for a chaos engineering day.
So clearly you're passionate about what you do for a living, and therefore
you're never wrong, ever.
It's a state of being mistaken or incorrect. So we know how it relates
back to the word wrong. And I've got some really bad news for you.
Dr. Russ has got some bad news for you.
Everyone is wrong all the time.
All the time. Every system but is wrong. But I
used to tell people has a developer, I want you to have a mantra.
I want you to have this in your head at all times. We don't know
what we're doing. They don't know what they want,
and that's normal. If youre could just have that rather than
I know exactly what I'm building today, then you might
build things a little differently and you'll certainly embrace production slightly differently.
And that's what I want you to do. It's a mindset switch to be working
in chaos engineering, and it starts with recognizing we're always wrong.
Okay, don't take my word for it. Just go out there and
look for incidents. Yeah, you have to search for incidents. If you search for surprises
online, you get different answers. But if you search for incidents,
you might find them and you find a whole plethora of different things out
there. Some great incident reports, some of them are terrible. Some of them
read like, we knew what was happening. We knew
exactly how this was going to go from the moment it happened.
Those are called lies. They're not called incident reports. The best ones
are the ones that involve some sort of swearing and
a window on the people involved, the incredibly
complex cognitive processes that go around, oh, my goodness,
something's happened and we didn't see that coming. Those are the ones to look for.
Okay, but why is wrong scary? If we know it happens all the
time and we're going to be wrong, we are always wrong. That's why we have
agile agility is about being able to
adjust. Why would you need to adjust? Because you're not going
to go straight to the answer straight away. So why
is wrong scary? Okay, risk, maybe it's risk.
We feel at risk. Risk is a terrible term, though. I mean, most people understand
risk to a certain degree, but risk has got a whole lot of baggage with
it. What we really care about is consequences.
When someone says you're worried about the risk, youre actually saying, well, yeah, that sounds
like a nice phrase of I don't want to go to prison. Yeah, I'm worried
about consequences for things I've done. And that's fair because
wrong is attached to consequences. We believe,
okay, and why are we so susceptible to it?
Two factors. You want to move quick.
Most of the companies I go to, I say, do you want to deliver things
faster? They say, yes, most of them, some don't. I have one
that said, no, we don't want to at all. That's a story for a conversation
over coffee later. But most of us want to move quickly, want feature velocity,
and we also want reliability because feature velocity without
reliability is nothing. I don't
know how quickly you can ship a product and say, look at this amazing thing,
and if it doesn't ever work for them, do you reckon you've got customers?
No, they always want both and that's fair.
Okay. And we think there's a conflict relationship. I've lost
count of the number of times I've heard we don't have time to do
testing. We're working on features.
What? Really? How do you know you've
got any? Oh, well, no one's complained
yet. Actually, I heard that once for someone telling me how they knew a system
was working. I said, how do you know your system is working right now?
And they said, our customers are not calling us.
I thought, what an impressive metric that is.
It's a real one, unless the phone systems are down.
But equally, don't you think you'd be a bit more proactive about this?
But anyway, we end up thinking these two things are in conflict now. It's a
great body of work out there now on this that proves this is not the
case. The good news is that there is no, absolutely no conflict
between these two factors. The faster you move,
the more reliable you can be.
In fact, the people that move the fastest are the ones that are also paying
down the reliability as they go. So this is,
don't take my word for it. Go and read a book called accelerate, please.
Read it. If you haven't read it already. It's not really a bedtime book.
It's a scientific book. A lot of data, a lot of graphs.
Yes. So it depends. If you have trouble falling asleep, then it's
perfect. But no. Generally speaking, if you wanted an exciting
change of mind sort of thing before you go to bed, I won't recommend a
novel, but this is pretty good. You can always read Jim Kin's novels, of course,
the Unicorn project. That might send you to sleep as well, but with better dreams.
Okay, so it's actually the two things
working in tandem. And these are some of the charts that come from accelerate.
I won't go over them in too much detail, but the point here
to make is that the faster you go. Let's look at one that really matters.
Yeah. The change failure rate. That one there. Okay. The faster
you go, which is the high performers, which is down the bottom.
The fact that they put less failures into production as
they go. So they've already got less to deal with than anybody else.
They'll move quicker. Okay,
so it's not versus. It's plus. Let's go through.
Okay, but I hear this a lot. But we're doing microservices. We've just invested in
microservices. And you could change services to anything you like. You could change it to.
But we're doing agile now, or the, one of the ones I love at the
moment. But we're digitally transformed now.
What does that even mean? We've transformed things
digitally. Yeah. Right. Okay.
But we're doing that. So we've got tests, gates,
pipelines, isolation. Oh, my. We've got the
lot. We are absolutely covered.
I've had this meeting, board level meetings
in banks. You'd think they knew, but chaos.
But no, we're covered. Believe me. We are doing everything.
And I'm sitting there going, you could still do everything and it will still have
failures. I actually asked a room full of bankers. I don't
know what that collective noun is. A room full of bankers.
I asked them, how many here have had a systems failure
in the last two months? Almost all the hands went up. Okay,
so we agree it happens. Then I asked, okay,
who here has got a disaster recovery program
that they've exercised in the last month? And a
lot of hands went up and said, yes, absolutely. And I said, how many of
you did it on purpose? Most of the hands went
down. So all of those disaster recovery plans
are not exercised apart from in the moment when you don't want to exercise something
new. The moment when you're surprised?
That's like getting on a plane and then saying, look, in the event of a
crash or something, wing it.
You know where the windows are, right? Get out. Or something like
that. It's just help yourself and hope.
Okay? There's a strong relationship between disasters,
incidents, surprises, and chaosiq for engineering.
Okay, so here's a quick story for you, and I hope
this isn't a story you've heard many, many times before. Does anyone know who
that is? Margaret Hamilton.
This is her featured in Wired not long ago. She has a
great story, the beginning of the SRE book. There's a great story right in the
preface, which no one reads, by the way. When I wrote one of my books,
I put in the preface, it was a book on Uml. Don't judge me.
Everyone does things in their youth, right?
I wrote a book on UmL. In the preface, I said, uml is for sketching.
No one reads that. I still get emails from people saying,
I can't make rational rows. Whatever it is. Do the
diagram you're showing me. I'm like, that's okay, because it's
a tool that's not great at that.
So, yeah, people don't read prefaces, but there's a great thing in this preface of
the SRE book where it talks about Margaret. And that's the story I'm going to
share with you now, the story here is quite straightforward. I'm going to paraphrase quite
a bit of it. So Margaret worked as the lead,
essentially, on the system software for the Apollo spacecraft.
Fabulous job. But this story isn't about her. She is not the world's
first chaosiq engineer. The world's first chaos engineer was
Lauren Hamilton. Lauren Hamilton used to get taken to work sometimes,
and like any responsible parent, when youre kid's there and you're trying to work
and I do this, now, what do you do? You get them there. You sit
with them. You go, I've got to do some work now can
you go away and not bother me? And that's what Margaret did. She gave
Lauren a toy to play with the Apollo mission
simulator. What a lucky kid.
She had no idea how lucky she was. Okay, so she's playing
with this thing, and she does what any kid does with a new toy.
She breaks it. Breaks it almost immediately,
she finds a key combo, manages to flush the mission navigation information
out of it, and Margaret does
what anyone would do when a chaosiq engineer has done that, says,
okay, we should learn from this. She goes to NASA, bigwigs NASA
bigwigs turn around and say what every manager says. When you say,
we found potentially a flaw in the system, maybe some technical debt, we want to
pay it down, what do they say? That'll never
happen. I'm glad you found that. That's fine.
But in their words, they said, our astronauts are made of the right stuff.
They've been trained for months, sometimes years,
to be perfect. They will never cause this failure.
Just because your daughter has created it and found it
surfaced, it doesn't mean it'll ever happen,
because we have. We are perfect. The very next mission,
Murphy's law comes into effect. Murphy's law is a law of
software development and delivery. Don't ever kid yourselves. It's not
just a law of life. It's firmly embedded in our psyche.
So Murphy's law kicked into action, and this picture
would never have happened. Although technically, I suppose it could have happened. But as the
spacecraft shot past the moon, maybe because those
well trained, right staffed astronauts on not
long after takeoff, hit exactly that key combo
and managed to flush the information out system. Fortunately, there was
a backup strategy because Margaret had learned from the situation,
she'd put something in place. She hadn't completely ignored the big wigs,
but she also had a compensating strategy in there
for it. So the world's first chaos engineer,
at least I think it is, is Lauren Hamilton, the world's
first resilience engineer? Perhaps not the first.
Is Mahgret in this story at least.
Okay, but it gets worse for us because youre
be forgiven for thinking you don't work on spacecraft. Anyone here
do? Oh, damn it. Okay. I've got a friend who
used to work on spacecraft, and I used to love that. I used to be
able to say, no one works on spacecraft. And you go, actually,
fair enough. But it gets worse for us because actually, our systems
are vastly more complex than a spacecraft,
bizarrely enough. And we have, because of that complexity,
something present. And I love the phrase dark debt.
The first time I used this phrase was on Halloween, and I can't tell you
how much fun I have with it. So dark debt is different than
technical debt. Technical debt is something, you know,
youre accruing usually. Okay, you can recognize it
and go, yeah, we knew we were doing that. We knew it has a shortcut.
That's technical debt. If you find that you've made a mistake, that's called surprise
debt. And that also is a factor. You can
then call it technical debt when you've recognized it. But generally speaking,
technical debt is something you plan to do.
Dark debt is the evil, evil, evil cousin of
technical debt, because dark debt is there. No matter
what you do, no matter how right you think you are,
it's in that system lurking,
and you don't see it. It's the unknown.
Unknowns. Okay,
so I got asked once, at this point in the talk,
how do I measure how much dark debt there is in my system?
Right. Let me explain again. It's unknown.
Unknown. It's not like dark matter
or dark energy that could be measured by the fact we don't know why the
measurements don't work. And so we can estimate there's an error and say there's
something there. No, dark debt is
completely unknown. It's completely building. It's camouflaged.
It looks like working software so
you can't measure it, but it's there. And there's enough
people that have done enough papers now to prove it's there.
Okay. Those sort of people are John Osborne.
Aaron was going to be here today, so I was going to point out that
Aaron got in early and liked that particular tweet, but unfortunately,
he couldn't make it today. So there's a whole group of
people that have found what this manifestation looks like. And it's the surprising
stuff. The bad news is you can't design it out because you don't know you're
designing it in. You can't sit there in a design
review meeting and go, aha, youre doing that wrong.
I see some dark debt. That's not how it works. You can do
what you think is absolutely everything, right? And you will still be wrong.
So get used to it. We have a plan for that.
And over to the business in this regard. This is why they
care. 1 hour of downtime costs about $100,000 for 95%
of enterprise systems. Not even the jazzy stuff. This isn't your
Netflix. This isn't your Disney pluses or anything like that.
This is your Jira going down. This is your
email systems going down. This is your back office systems going down. This is nothing
sexy. This is how much it costs to a business if
things go down. And they don't like that, understandably.
And that's just in lost revenue and end user productivity.
It's got nothing to do with being sued. That's on top of
that. So there you go. That's what threatens the business,
is this feeling that there's a lot of risk.
And you've just told me there's dark debt, which is
more risk. And you've told me you actually can't see
it till it happens. Okay,
so the bad news is, forget Donald Trump. You're not covered.
We're not covered. You can't be covered for this,
but you can be better at dealing with it because
these systems tend to look like, even if you try and fully describe your systems,
there's lots of moving parts, there's lots of details. Rate of change is high.
Components, functions can change. This is just reinforcing
the fact that what you've got is hard to rationalize about.
Okay. Reactions to this risk avoidance.
Well, okay, we were going to do a cloud native transformation, but actually
now we're going to go back to the mainframe. Bad news,
mainframes are full of dark debt. They are.
They wouldn't break if they didn't. So that's all there,
too. Doesn't matter what technologies youre use. Don't get mistaken
into thinking that this is just for the new stuff. This is
there. In any sufficiently complex or chaotic system,
blame is another beautiful one. Right? So I
was on stage a couple of years ago now, and I love this story because
this is really cool. I asked the entire room, and I
won't ask you now because it's an intimate question, but I asked,
who here would like to share a post mortem? Who here would like
to share an incident they were part of? And this
person, it was like I had offered a glass of water to someone
wandering across the sahara. They wanted to confess.
He shot his hand up. He then shot up himself. He can to the
stage. Now, I'm a very short individual,
and this has, in Norway, where there are not many short
individuals, he ran to the stage, and the stage
was about five foot off the ground. So I'm looking down,
thinking I could die. And he leapt onto the stage,
and he's up there, so I'm not going to get in the way.
I step right back and I go after youre. And he turns
around to the entire room, about 300 people, and says, it was
me.
At that moment, you ask youre, what could
possibly go wrong next? Do his colleagues know?
Has he just done it? Is he going to get
arrested? Is he going to be sued? What's going to
happen now? And anyway, fortunately, his colleagues started laughing,
so I hope they knew. Anyway, he turned around to say, it was
me. I destroyed the master node of a grid computing cluster
and lost about three weeks of processing and research work for a university.
I did it. It was me. Okay,
step back from that. Your first reaction to that story,
to that person, would have gone something like this, I think it
would have gone well. There's an easy answer to this. He doesn't
get to touch the keyboard anymore. Done.
Problem solved. How do we make this not happen again? He doesn't
have a job. Okay, that's one reaction.
Not the best. The best reaction, though, is to step back
and go, hang on a second. Blame. He's just
blamed himself, which is why we shortcut to. He doesn't
want to need to touch the keyboard anymore. He's not allowed blame
shortcuts to. The solution is that person.
So whether you're blaming someone else or they're blaming themselves, the shortcut
is too sweet not to take. So that's why.
One of the reasons, just one of the reasons, we don't like blame.
The way the story went, though, is I asked him, okay, what did you do?
Show me the command you wrote. So he wrote it up on the screen,
and I showed him, said, show me the command you wanted to write.
He showed me the one he wanted to write.
One character difference between these two
commands. He didn't do it any wrong.
It was a trap. This was a grid.
How many people here check a grid to make sure it's right?
Has there any dialogue that said, are you sure that's
going to take down the whole cluster? No, nothing like that.
It was just a grid. Nonsensical grid. Kill grid.
Wow. Okay, you did. And it took the
whole thing down. And we build traps into our systems all
the time, and we don't know their traps until we
trick them, trip them.
So blame, though, leads us in
completely the wrong direction. The truth here is there's many,
many changes we could make to the systems to help avoid
the human, natural human error that always occurs.
We're human. We make errors. That's how we work.
And in this case, just a little bit of a dialogue that
might have just said, are you sure that's going to take
down the world? Would have perhaps helped him do the right thing.
Okay. There's a whole pantheon
of papers on blame, guilt, and everything that goes with it psychologically.
So please go out there and look at those. Okay. A bit of
reaction, though. Let's turn it around. Let's say being wrong is a
super skill. So this person has written a command and gone, oh, I've just killed
the entire cluster. Right. Say they were a chaos engineer.
Now switch to the mindset of chaos engineering. The mindset is different. The mindset
is the mindset of a motorcyclist.
Okay, so who here drives a car now, in London.
This can be iffy. All right. Most people drive cars legally,
I hope when you legally learn to drive
a car in this country, there are other countries where you don't have to do
this, but in this country usually they teach you to drive defensively.
What they say is when you get out on the roads, treat the world
as something that can't really see you. So drive
like everyone's blind and then you'll be okay. Give them the
space because you don't think they can see you. Even if they're looking at you
with their eyes, they're probably thinking about something else. So drive defensively
and you'll be better. And we all forget that about six weeks after
we passed. But that's what we're told usually is. Drive defensively.
Assume that everyone can't see you.
As a motorcyclist, you're taught something else. You are
taught that they can see you and they want you
dead.
It's not paranoia if it's true and
it keeps you alive longer. You literally ride like everything's
out to get you. That's one of the reasons that riding a motorcycle
is so calming. I know, sounds wrong, right?
But it's calming because you can't focus on anything else.
You can't think about what's happening at home, you can't think about what's happentm at
work. You just have to stay alive and everything's
out to get you. And that's why it's so wonderfully zen to be
on a motorcycle. Okay, so you ride
along and everything, the weather, the little old lady with the little
dog, they're out to get you. And that's how you live longer.
That's the mindset of a chaos engineer. Every system is
out to get you. It's not passive. Production hates you.
It knows where you live, it knows when you're on a date,
knows when you're trying to sleep. It's a bit like Santa Claus,
but really evil. It's watching you at all times and
you're going to be wrong with it. So it's a threatening environment. You're going to
get things wrong. So you're going to make sure you have as many compensating practices
involved so that youre can at least survive longer.
Production can survive longer, you can survive longer, has a participant
in it. So being wrong is actually a key software
skill. And I'm going to invent a new methodology right here today.
Forget agile, we're going to do get better at being wrong.
Hang on, is that a rude acronym, no. Get better at being
wrong. Okay. No, it's never good. I'm a big
hater, I suppose, of all certified
programs. So if anyone here is inclined to create
a certified chaosiq engineering program, I will find you and
I know where I can hide the bodies. So,
yes, we don't need a certification program. So get better at being wrong is obviously
a joke, but we can make it safer to be wrong.
And that's what we're going to do with chaos engineering. We're making it safer for
us to get things wrong with our system. We know we care about
our system. We know we care what it can do, but we also know there
are conditions under which that might not happen.
So we're going to make it safer to be in those
situations. When that does happen,
we inject technical robustness. So I'm careful with my terminology here.
I hear a lot of people saying I'm going to make my system more resilient,
and they mean they're going to make their software more robust.
It's a slight technical difference. Robustness is for the stuff you know
can happen. Resilience is for the stuff you don't know is going
to happen. Resilience is the learning loop that says that when
that occurs, we know how to learn from it and get better at it.
We practice zero blame, because, as I say, blame is a shortcut to the
wrong answer in many, many cases.
So we go way beyond blame. We look at dark debt.
Dark debt is what we care about as technical chaosiq engineers.
We look for dark debt, we surface it. We surface evidence
of it. I had this recently. I was writing this book
on Chaosiq engineering, and one of the things I kept coming back to is,
why do we do chaos engineering? And I get this from users
as well. I've created this thing called the chaos toolkit with my best friend and
this open source project. It helps you do experiments in chaos engineering,
but I get loads of companies saying, okay, we kind of get that.
What experiment should we run first? What should we do? We've got kubernetes.
What should we do apart from run to the hills? Now, what should we do?
I'm not a Kubernetes hater. I actually quite like kubernetes. But,
yeah, we've got kubernetes. A lot of moving parts,
generally. How do we know that our system will survive
certain conditions? What experiments should we run?
And it's a fair question. And our usually response
is, what do you care about? Do you care about availability?
Do you care about durability do you care about the fact your home
page is there or not at all times? At most
times? If you care about one of those things, you have an objective.
If you have an objective, you can start to measure that objective.
Okay, so we're starting there, right? If you can measure your
objective and you have a target, an objective for it being up
perhaps 99% of the time. Great. We've got the ability
to have a bit of leeway there. Okay, now we
can say, what conditions do you want to put the system under? What do you
think could happen? Could it be a pod dying? Well,
yeah, probably. Can your network degrade? If it can't,
I want that network. What the sort of conditions do
you want to get knowledge of that? Your objective
will stay within its parameters of being, okay, 99% if
that happens. Right, let's do that. Now, you notice there,
not once have I said chaos, not once have I said experiment. I haven't
really said test, although it is one. What I'm
doing is verifying objective. That's the language I'm
now using when I help people define chaos experiments.
What is the overall business objective that you care about? Make sure you
can measure it. Right now we can look at introducing
conditions to the system, and we can see what impact on the
objective that might have.
No one cares or wants chaos.
Everyone wants something else and wants to verify
it. So that's what the language I tend to use for that, and that
helps us with dark debt in that book. What I say in the book is
chaos engineering provides evidence of system weaknesses.
Verification says what you found is useful because
it helps me with an objective.
Okay, so chaos engineering could be better phrased
as deliberately practicing being wrong. We're going to get
better at it and do it more proactively. We're going to verify
something all the time. We might continuously. Is that word
again overused in our industry. We're going to continuously verify
youre system with experiments to see how our system
responds, because there's something we care about about that system,
and we can measure it.
We can prepare for the undesirable circumstances by, in small ways,
making them happen all the time.
And that's what chaos engineering really is,
is deliberately practicing being wrong,
and it's an investment in resilience. Now, there's that other r word.
So, robustness, resilience, they often get confused.
You're resilient if when something unexpected happens, you can
respond to it, learn from it, and improve the system.
If you have an objective that is measurable and you do chaos
experiments against the system, and you can see what that impact on that objective
would be. Then you have all of the ammunition to analyze
and produce actions out of what you're doing, prioritized actions.
This is important because if you just broke a system, so say you did chaos
engineering in the smallest possible capacity, where you said, actually,
I'm just going to go in and I'm going to screw around with the pods,
see what happens. All right, you do that and
you go, okay, well, things didn't seem to go very
well, but we don't know what the impact was on what we care about,
and we don't know how to prioritize any work that we might do out the
back of it. Without the objective, you don't know
what to improve. You need to be able to
say, we did this, it affected this,
we care about this. Therefore, we can prioritize
these improvements on the back of that. This is why I tend to use
the phrase verification of a system rather than merely
turbulence and chaos,
because it enables the learning loop, it enables
us to be able to join things up and go, we know why we're doing
this stuff in the first place, and we know what we're actually going to do
in response to it tends to look like this.
So this is my very rudimentary how things happen. You have a normal
system, something goes kablui.
Now, I love this detection, diagnosis and fix. Sounds so
calm, doesn't it? It's easy. In there is hell.
In there is someone being woken up at the wrong time of day. In there
is someone going, we don't know how to do this.
In there is someone staring out the window, wondering why they are still in this
job. Eventually you get round to something you can learn
from if you avoid the blame shortcut, for example,
and then you can improve the robustness of the system to something that is
now known. Now, in this case, it was an outage.
It was a surprise. It was a big deal. Wouldn't it
be great if we could do that in a predictive way? We can
convert it into pre mortem learning rather
than post mortem learning, which is chaos engineering.
And it looks like this, you do a game day or an automated chaos
experiment. It's only two things in chaos engineering. Another reason we don't need a
certification program, two things we
can do, game days, which is great fun. I've got so many stories about game
days, I don't know if I've got enough time to share them, but they are
great stories about game days. Okay. Automated chaos experiments.
Again, you're doing this in a small way. Small fires.
Small fires require less fire hoses,
detection, diagnosis, and fix. You can figure out how we've reacted
to it before. It actually is a massive problem. Turn it to learning. Turn it
into robustness. You do it across the entire sociotechnical
system. And this way,
we can learn before outages happen, because we can't
design them out, but we can learn about them before they
become the big catastrophe that we're trying to avoid.
So this is the truth. Being wrong is a superpower. Well done.
As software engineers, we're wrong all the time. It's a good thing because
we can turn it into something being great.
And you can have a platform for deliberate practice of being
wrong, which is a platform of chaos engineering,
and you will. Or verification, as I tend to call it.
Okay? So if you want to have a play, have a play with
chaos toolkit. That's a nice way to start. Lots of companies stay there.
You can do amazing stuff with it. People can talk, but other tools later
on. There's truckloads of tools out there in chaos engineering. It's fabulous.
The richness of open youre and free chaos engineering tools is
a brilliant thing, and we're happy to be part of it. Frankly,
Chaos IQ is the thing I work on. If anyone wants to use something like
that, it's a verification tool. So it's a software as a
service verification tool. If anyone wants to give that a spin, come and give me
a shout. I'll be around till lunch. And then, unfortunately, I have to get on
a jet. That sounds really, really posh, doesn't it?
I have to get on with about 300 others on a jet. Let's point this
out. And I'm not at the front. Okay,
so these are the books out there, if you want to grab them. One of
them. I wrote one of them, actually, I wrote both of them. Both of those.
Great. There are other ones. There's a chaos engineering book by the
Netflix team, which is fabulous. There's also another chaos engineering book coming out later
on this year that I've contributed a chapter to, as well as a few others
speaking today. And out there in the wild, learning chaos engineering,
I'm told, is a fabulous place to get started in chaos engineering.
I'm totally unbiased in that. And, yeah, if anyone does actually
buy that, or if anyone asks me very, very nicely, I'll send you a copy.
Anyway, it's a 40 quid book.
I'll send youre it. And if you want a signature you can.
Otherwise, thank you so much for being here today. Chaos engineering
is a superpower. I mentioned earlier very quickly. I mentioned earlier
that there is no best practices in our world where it's complex and chaosiq.
I lied. Chaos engineering is the only one
I know of. I can tell you. Just do it and
you'll probably be better.