Conf42 Chaos Engineering 2020 - Online

- premiere 5PM GMT

How to be WRONG

Video size:

Abstract

Being wrong is often seen as the WORST THING THAT CAN HAPPEN™, especially when you’re building business-critical applications and services. ​ But the increased velocity of modern software development, plus the increased need for our systems to be resilient, reliable, and RIGHT has increased the pressure on developers exponentially. Never before have software owners had such an opportunity, or the power, to BE WRONG!

Summary

  • This talk is a slight adjustment on a talk I've been doing for a little while now. It's useful to sort of box chaos engineering and to talk about how we might go beyond it. At the end, I'm going to be saying, okay, everything you're doing is right, but there is a different way of packaging it.
  • I am quite happy with risk as a motorcyclist. I embrace risk on roads. Fire drills are a terrible analogy for chaos engineering. They make you complacent. When it actually happens, you'll be prepared.
  • A lady who had absolutely no ability in her brain to become complacent survived 9/11. When I talk about today, I talk, but how chaos engineering should be viewed. Be very sensitive to complacency when you practice something.
  • Back in Tibet, I saw some wonderful things. It gives me several analogies for software delivery and system delivery. This theory of making something safer means we can often make it a lot more dangerous through complacency. And interestingly enough, there's lots of funny things with safety when it comes to chaos engineering.
  • Everyone in this industry absolutely loves new terminology for old stuff. New trend at the moment is cloud native. Can we stop inventing terminology? We're terrible at it.
  • My personal nemesis is we took a beautiful renaissance in data access and persistence technologies, and we did a beautiful number on it. We call it by what it isn't some of the time. So now we're doing something different. We know it's something different, but we don't really know what it is, so we call it services.
  • Chaos engineering. How many people here have been in conversation and thought, for the 80th time, I'm not engineering chaos because youre not. In some companies, it's called release day. I work with one company, they said, we do continuous delivery. Brilliant.
  • If you go in something like cloud native, there's a lot of moving parts. Even when you think you've done everything right, this is production. Don't use the word incident ever. Use surprise.
  • You do not have an obvious system unless you are writing hello world for a living. If you're building any sort of distributed system, and frankly you are, this is one of the few things I don't need to see. You have to apply such thought processes as systems thinking.
  • The smallest change can cause the most dramatic and surprising impacts. The only systems I know of that are not, don't follow these sort of properties and are making people money tend to be systems that are no longer changing at all. We're not here to create systems that don't work.
  • I teach people how to be wrong for a living. To me, it's a superpower. I am regularly wrong. At university, I was supposed to never be wrong. I still have to now tell people to do TDD.
  • Being wrong is a state of being mistaken or incorrect. Everyone is wrong all the time. Every system but is wrong. It's a mindset switch to be working in chaos engineering. It starts with recognizing we're always wrong.
  • Most of us want to move quickly, want feature velocity, and we also want reliability. The faster you move, the more reliable you can be. There is no conflict between these two factors. Go and read a book called accelerate.
  • There's a strong relationship between disasters, incidents, surprises, and chaos for engineering. So all of those disaster recovery plans are not exercised apart from in the moment when you don't want to exercise something new. Believe me. We are doing everything.
  • Margaret Hamilton was featured in Wired not long ago. There's a great story right in the preface of one of her books. No one reads that. I still get emails from people saying, I can't make rational rows.
  • The world's first chaos engineer was Lauren Hamilton. She worked on the system software for the Apollo spacecraft. Murphy's law is a law of software development and delivery. Because our systems are vastly more complex than a spacecraft, there is dark debt.
  • 1 hour of downtime costs about $100,000 for 95% of enterprise systems. In any sufficiently complex or chaotic system, blame is another beautiful one. You can't be covered for this, but you can be better at dealing with it.
  • David Wheeler: Blame leads us in completely the wrong direction. He says the best reaction is to step back and go, hang on a second. The mindset of chaos engineering is different from that of a motorcyclist. Wheeler: There's many changes we could make to the systems to help avoid human error.
  • Every system is out to get you. Production hates you. It knows when you're on a date. It's watching you at all times and you're going to be wrong with it. Chaos engineering makes it safer for us to get things wrong with our system. We practice zero blame.
  • What is the overall business objective that you care about? Make sure you can measure it. Chaos engineering could be better phrased as deliberately practicing being wrong. You're resilient if when something unexpected happens, you can respond to it, learn from it, and improve the system.
  • There's truckloads of tools out there in chaos engineering. The richness of open youre and free chaos engineering tools is a brilliant thing. Out there in the wild, learning chaos engineering is a fabulous place to get started.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
It's. It's pretty early, right? Feels early to me. I come to London once a year at the moment. This is the only time in 2020. And I had the issue of living on the south coast of England, which, if anyone's familiar with that, it's about. Takes about as long as it takes to get to the east or west of London from here. About 2 hours door to door. So it's been an early start, which means ready and very awake, and I imagine you may not be quite so. So I'm going to try and make this as easier a way of getting into the day as possible because there's a lot of fantastic talks coming. Okay, I'm going to do that because I can't see anything unless I do that. Has anyone seen me speak before? Raise your hand if you have. Excellent. So some of you have seen me speak before. That's great. I can't completely reinvent myself then. Damn it. So I'll actually have to keep to the story. I can't see my slides yet. Is there a reason they're not up on here yet? I plugged everything in. They are now. They are brilliant. You're a star. Thank youre. Right. So this talk is a slight adjustment on a talk I've been doing for a little while now. And it's pretty useful to sort of box chaos engineering and to talk about two things, really. One, what chaosiq engineering is and how we should approach it, and two, how we might go beyond it. That's the new part today. Now, I'm not saying that youre come to a chaos engineering conference and somebody's already saying, go beyond it. And you're probably sitting there going, I don't know what it is yet. Or if you do know what it is, you're probably going, I hope he says it's what I'm being, but that's all good. I'm actually not going to say, oh, youre do this. And then here's the really good part. At the end, I'm going to be saying, okay, everything you're doing is right, but there is a different way of packaging it that just might make it easier for your company to endorse and explore and be happy to put a budget on. So, okay, let's get going. Me, for those that don't know you, I am quite happy with risk as a motorcyclist. That's what I intend to embrace. I embrace risk on roads. Anyone here a motorcyclist? Excellent, brother. How youre doing? Okay, so I ride a Harley Davidson. So it's sort of a motorcyclist. It's not really. It's a couch on two wheels, but it's a lovely bike. And this was actually me on a much smaller bike. I was over in Tibet riding to Mount Everest. You have to do it on the north side, you have to do it in Tibet because on the south side there is no road. So I rode a motorcycle to Mount Everest. And on my way to Mount Everest, I got some wonderful opportunities, some beautiful pictures. And all of them led me to a greater understanding of this troublesome activity that we're involved in called system delivery and system development. Someone's died next door. It's not as bad as. I've had some great analogies in my talks where something goes on in the background and we actually had a fire alarm go off once. And I thought, that's perfect for chaos engineering. It's a fire alarm and you just watch everyone's reaction. Quick detour in a fire alarm, what do you think everyone's first reaction is? Check their watches. Is it 11:00 a.m. If it is, it's a drill. Very few fires have spontaneously started on the dots of 11:00 a.m. So usually go, oh, yeah, it's a drill. Even if it's not, even if they see smoke, it's a drill. So, yeah, I've had that. That's actually a good little starting point. Fire drills are a terrible analogy for chaos engineering. It's often used, you practice these things because they matter, because when it actually happens, you'll be prepared. Does anyone know the basic problem with fire drills? They make you complacent. They make you think you know how it's all going to go. I'll tell you a very, very quick, tragic story. Very much. Very much, sir. I've got the most interesting heckler in the world. Brilliant. Don't normally get heckling in tech talks. So this is all the story. This is about a lady who had absolutely no ability in her brain to become complacent. Whatever the Wiring was in, the brain could become complacent. And she happened to work very, very high up on one of the twin towers. And obviously the fire alarm went off. The yellow tower had got a problem. Everyone was told to leave. She did. She got up, she walked the stairs, and she walked from the very top of the tower to the bottom. And she survived. While she was going there, everyone else on her floor didn't have this different wiring in their brain. What were they doing? They were finishing their phone calls. They were looking at their screens and wondering if it was for real. They were picking up their photos of the family. One person walked with her down a few flights of stairs and then came back because they've forgotten something at their desk. They've forgotten their laptop. None of those people, none of those people made it out of that building. So complacency is dangerous. When I talk about today, I talk, but how chaos engineering should be viewed. And I'm very careful because I call it a practice. Youre a practitioner of this, but be very sensitive to complacency when you practice something, just because that's what happens in fire drills, is we get so used to them going off, we literally go, right. All those things you're told what youre told. Absolutely don't do, don't pick up your laptop, don't just leave it all go. And you won't because you're complacent. So, yeah, just, that's the downside. That's the flip side, the nasty side of chaos injuring. Right. Back to the story. Back in Tibet, I saw some wonderful things. I saw this. Now, that looks like it's going along a plain, but actually it's going down a very, very steep hill. It's just one of those photos that's a bit confusing. The beautiful thing about this is it gives me several analogies for software delivery and system delivery. Number one, that's the best analogy for a software roadmap I've ever seen. And actually, I go further. Most software roadmaps I've got have got lots more dead ends, like side roads that go off a cliff. A lot of those, but they're not on this one. So it's not perfect, but it's not bad. The other thing that you can't seen so well on this beauty is that some of those corners doesn't have tarmac on it. Now, I'm a motorcyclist. Your biggest fear about going around a corner, leaning over is lack of grip. And they've taken the tarmac off on the corners. Okay, so why have they done that? It's because the water, when in the rainy season, when there's lots of water, tunnels its way down, this actually takes off that top layer. And so what they've done is make it safer. We'll remove it completely. Not sure about that thinking, but it was there. The other thing is, you go around those corners and what they've done is the Tibetans, in their absolute wisdom, have looked at this and gone, it's a bit dangerous. We need to let people know we've taken the tarmac off. What should we do? Should we create a sign maybe 100 yards before it? No, what we should do is create. Put little rocks about that big little boulders around it like it's dangerous here. They've turned a mildly dangerous chunk of road into a lethal hazard. Brilliance. And we do that in software all the time. We go, oh, it's a bit difficult here, you mustn't change it. So what will we do? We're putting a readme in a repo somewhere. No one will see that we're constantly leaving traps in our world. And interestingly enough, there's lots of funny things with safety when it comes to chaos engineering and how it relates to safety and how we're building safety in our systems. And this theory of making something safer means we can often make it a lot more dangerous through complacency, or as in this case, rocks around a hazard. Anyway, that was part of it. I learned that. But what I learned even better is I learned what people think production looks like. Serene, beautiful. I took this photo from the foothills opposite Mount Everest. So lucky it was a clear day. Now, you could argue if this was an analogy for production. There's a lot of cloud there. Hiding a lot of sins. That's also true, but that is what we think we've got. You must have seen the PowerPoint slides that someone puts in front of you and says, look, here's production. Three boxes, straight arrows. Gorgeous. Where's the complexity? It's not even there. It's perfect. And then I got the real analogy for production, though. So that was in front of me and admittedly not immediately behind me, because when you get to seen the picture, you'll realize that would be disgusting. But there was behind me, in about 20 or 30 meters behind. Good distance from it, by the way. Again, it makes sense in a minute. There was the real production because this is what was behind me, some distance behind me. This is the toilet at Mount Everest, not to be gotten near for about 23 and a half hours of the day. There's one half hour period you can go near production. I mean, the toilet. And I will not tell you now why that half hour period exists, but needless to say, it has a lot to do with burnt yak poo. And it becomes safe. You can actually see it. It's completely and utterly. You could own that toilet all day long. And then the moment the yak poo goes on, the queue forms. The queue can be up to half a mile long. So, yeah, you don't get the end. Okay, so that's the truth, right? Most people on stage don't explain what production is like. They say, oh yes, look what we did. I remember going to talks over and over again at big conferences where people say, this is production. We've done this, we've learned this, we've learned that. It's been wonderful, it's been great. They're lying. It's been hell and they need to tell you that. And so I'm hoping that more do and chaosiq engineering helps with that because it makes our hell much more shareable and much more comparable. So let's get into it a bit more. Okay, so what happens usually, and let's talk a little bit about the trends at the moment because everyone in this industry absolutely loves new terminology for old stuff. Okay, we're going to do microservices. No, it's not, no, no, it's not Corba. It's RPC, but it's not Corba. We're going to call it Google something. Those sorts of things happen. So we're going to do something new. It's going to be different. We're following new trend. New trend at the moment is cloud native. Let's go cloud native. Because then it's someone else's problem, right? It's someone else's. If we've got people from aws here, they'll tell you it's their problem and then they'll explain why much of it is yours. I'm glad he's laughing and not looking at me with evils, although I couldn't tell. Okay, so you're going to go and do something new. You're going to build something new perhaps. Or if it means youre got an existing system that you need to improve the reliability of, either way you've got something and you're going to try and do something cool. And chaos engineering is in the same bracket. Right? Quick message. Can we stop inventing terminology? We're terrible at it. Okay, let's talk about some of this stuff. Services, I'm not going to ask you what you think of microservices because I don't know and I teach a class on it. Very first thing I do is say, I don't care how big they are. So what does that leave us with? Services? We're doing SoA, no, we're not. We can't do that because that has really screwed up. So now we're doing something different. We know it's something different, but we don't really know what it is, so we call it services because that's a good idea. That is not the worst, though. My personal nemesis is we took a beautiful renaissance in data access and persistence technologies, and we did a beautiful number on it. We turned it from being a renaissance, a wonderful realization that we had all of this pleasure of different ways of storing and retrieving data. And we called it by what it isn't. No SQL. Worse than that, we didn't even call it by what it isn't. We call it by what it isn't some of the time, and SQL was never the problem in the first place. So, yes, what a wonderful group of namers we are. Chaos engineering. Let's talk about that one. Okay. How many people here have been in conversation and thought, for the 80th time, I've now got to say, no, I'm not engineering chaos because youre not. That's easy. It's called delivery. In some companies, it's called release day. I work with one company, they said, we do continuous delivery. We deliver continuously. Brilliant. My job is done. This is easy. I'm here to help you do great software development practices, and you're doing continuous delivery already. You must be very good at this. He said, yes, we are. We continuously deliver once a year, continuously. Oh, man. Anyway, so naming things chaosiq engineering, we're not engineering chaos. Let me explain what we are doing. Okay. The problem with anything you're doing at the moment, what's the problem with any of the systems we're working on, is if you go in something like cloud native, there's a lot of moving parts. Now, this is a diagram that I have to admit I haven't read all of, but it has an awful lot of stuff in it that says, this is exactly how you approach cloud native. And there are other things that you could approach. Waterfall's got plenty in there. If you're doing waterfall, if you're doing continuous waterfall, a waterfall is basically Arkansas. Anyway, it doesn't matter. So you've got all those things that you could be doing, and there's a lot of moving parts in anything you adopt in our industry. So there's lots of movement and lots of possibilities for interesting surprises. Okay. And even when you do everything right, because I imagine everyone here does everything right all the time. I'm sure they do. I do. No, I don't. Even when you think you've done everything right, this is production. Nothing like a good game of Thrones meme for production. It's dark, it's full of terrors. Anyone who worked in operations can tell you this. I can tell you exactly what that's like at 06:00 a.m. When it goes wrong. 06:00 a.m. On a Sunday happens to be their birthday. That sort of level of 06:00 a.m. It's dark and it's full of terrors. Okay. Why? If we do everything right, surely we should be avoiding this. I get this a lot from business owners. Look, I've spent a lot of money on really great engineering talent and you're telling me this we're still going to end up with that? Yes, you are. It's essential in what you're going to have. So why is that? This, this is my analogy for production. This is my preferred one. In the foreground is what you designed. You have a field, the bunnies could, you can imagine bunnies running around that field. You can imagine it being a calm, beautiful day. And the background there is your users. In the background is your cloud provider. In the background is your administrators. It's all of the factors that can be surprising in your world. And I'm very careful with that word surprising because people use the word incident a lot. And incident is a dangerous word because it doesn't recognize the unpredictable natural occurrence that it can be. Let me talk about that briefly. Right. Has anyone here had that moment where a boss comes up to you and says, right, you are responsible for this incident? Why did this incident happen? Now if you're pretty professional, your reaction to that would be, I screwed up. We must have done something wrong. Okay, I must have an answer why this incident happened. Therefore I must be able to tell you what did we do wrong? Okay, don't use the word incident ever. It's gone from your minds. Use surprise. Imagine that boss turning around to you this next time going, why did that surprise happen? It's a surprise. They do happen. We could call it shit, but we're not allowed to. Sorry for the camera. Okay, so we call them surprises. Refer to these things as surprise because they're surprising. No one sits there and go, yeah, I knew that incident was going to happen. If they do, they're a sadist. Eject them from your team, okay? Or talk to them about it. So this is what production looks like. You've got turbulence in the background. That's your users, it's your admins, it's you, it's everyone involved in your system. And when I refer to system, I mean the whole sociotechnical system, I mean the people the practices, the processes, the infrastructure, the applications, the platforms, the whole merry mess. When you add all that up, I don't care if you're doing the simplest piece of software in the world. Well, maybe if you're doing hello world, you're probably okay. But if you're doing the simplest piece of production software that makes money, then you've probably got a complex world and turbulence is part of the game. And the beautiful thing about turbulence and the chaos that goes with it is you can't predict it. You have to react and get good at reacting to it. Okay, so I'm going to completely abuse Kinevin now to explain what sort of systems we're dealing with, just to make sure we all understand the problem, the challenge we're facing. Number one, you do not have an obvious system unless you are writing hello world for a living. And if you are fabulous, well done, keep doing it. Someone's paying you for that. You've got it right somehow. Your career path has been brilliant. Most of us don't get away with that. But it's really attractive to be down and obvious because it's best practices. I can turn around to you, just say, just do this and you'll be better. How wonderful is that? How many agile coaches have you had? Turn around and go, just do this, you'll be better. By the way, they shouldn't have the label agile coach at that point because they're not really helping you or coaching you. But set that aside, okay? We see that a lot in our industry. Just do this. It worked for me, therefore it will work for you. I don't need to know your context. Just do it. You'll be better. So few examples of that in our world, let alone in software. Okay, so forget that. Leave that behind. Maybe you'd be forgiven at a dinner party. To describe what you do for a living has complicated what do you do? I work on complicated stuff, software. And you watch their eyes begin to say, and they say, do you work on Netflix? No. Do you work at Apple? No. Or do you work on bank trading systems and you know they're gone. Right. And it's complicated what you do. You don't want to explain it. Dinner party applications are usually gross simplifications. So youre be forgiven for thinking you're there. And maybe good practices would be a good thing. Okay, you've got good practices. What a good practice is for me is something you can apply. You could try it and it will probably work. Experiment with it, play with it, see if it works for your world, it's got a bit of context sprinkled on like salt. Okay, bad news for you though. If you're building any sort of distributed system, and frankly you are, this is one of the few things I don't need to see. You're running a production system. I don't care if it's a monolith, parts of it are distributed. Unless you're actually embedding the database into your code, which is relatively unlikely in production, but can be done, then you have a distributed system of some sort. And if you have any distribution that makes things more complex. If you've got external dependencies, well, if you're running on the cloud, you do. It's called the cloud. Everything there is an external dependency. Well done. You've increased your surface area for failure. God, I love Siri. You've increased the surface area of failure. Okay, so you're in complex, probably, and that's okay. When you've got the sociotechnical system, all the people, the practices, the processes, the distribution, the external dependencies, you're in complex difficulty with complex. We've all seen it. When someone comes in and says, just do this to the system and it will be better. And it isn't, you know, that better has to emerge over time. You have to try something in there for probably an extended period to see if it really does have the impact you want. You have to apply such thought processes as systems thinking. It's a harder thing to work with and you know, you can't just go in and go tinker. It's better. You have to assess it over time, verify it over time. But I got bad news for you. You see, if we were just in complex, we could still logically rationalize about it. We could still look at it and go with enough time. We could look at all the different pathways through it and prove it works under quite a lot, if not all, conditions. The difficulty we have is that usually if you're agile, you have a system that should be evolving quickly, frequently changing. Not those people I mentioned earlier who do continuous delivery once a year. Youre usually are delivering more frequently than that, which means your system is changing more frequently than that. And again, when I say system, I don't just mean the software or the technical aspects, I mean the people, the practices and processes that surround it. All of that is changing frequently or can be changing frequently. And when you've got that, you are in chaos where the smallest change can have a big impact. And even then you may not be able to assess all of the impacts a change might have. So with that level of complexity and chaosiq present in your world, and I can almost guarantee if you're running production systems, you have this, because if you have incidents, sorry, surprises. If you have those, then you're seeing the outcomes of it. Okay. The difficulty here in chaos is that novel changes. You can do something and be surprised by how much better it is, or worse, you can make that small, simple change to how people operate and suddenly see a massive and dramatic impact on everything you've done. So it's a difficult place to be. So wouldn't it be great, right? Wouldn't it be great if we had a way of engineering ourselves out of chaosiq? Wouldn't it be great? Although we shouldn't call it chaos engineering because that sounds like we're creating it, right? So chaos engineering is engineering ourselves out of chaos through practice. So let's talk about that. It's about learning chaos engineering doesn't exist, shouldn't exist unless you emphasize the need to learn. We're doing these things for reasons. Actually, recently, I've started to call chaos engineering by a slightly different name. I've started to call it system verification. I want to verify how my system reacts to certain conditions because I don't know yet, and I want some proof of it, and I want to be able to turn that into actions where I might be able to experiment with improvements on my world. So it's a learning loop. It's a double loop learning loop because you're actually assessing your assumptions to begin with. And this is my favorite. Youre got to watch good. Anyone here watch good omens? Yes, absolutely. Everyone should watch it. It's brilliant. But, yeah, the idea here, it's the old idea with chaos that the smallest change can cause the most dramatic and surprising impacts. It's the stuff we deal with, and it's in our systems all the time. Okay, you are working, if you want a technical phrase, youre working with nonlinear dynamic systems. That's the sort of stuff that gets you a PhD. So it's useful to have that in mind, because if you say we're dealing with a chaotic system, most people look at it and go, it's not really. But if you say it's got properties of being nonlinear and dynamic, then that makes more sense. Yes, it does. Pretty much inarguable in most systems. The only systems I know of that are not, don't follow these sort of properties and are making people money tend to be systems that are no longer changing at all. They're very few, and they're usually being retired rather quickly. But even when you're retiring a system, you end up changing it. So that goes back here too. Okay. And as engineers, one of the things we love to know is that the system will work. We're not here to create systems that don't work. Usually, maybe, but that'd be an odd part of your degree. And it's the same with physicists. We like to think under these conditions we know what's going to happen next. But if you have a nonlinear, dynamic system, you don't know what's going to happen next. You cannot predict its behavior in a variety of conditions. You have to apply the conditions to see what happens. So, incidents of surprises. Let's go back to the theme being wrong. Now, it's obviously a bit tongue in cheek, because no one trains to be wrong, do they? Interestingly enough, I do a short module at Oxford University on this, and I call it the how to be wrong. As a software developers course, very few people come on it and I try not to depress them in 10 seconds. But I essentially say everything you know, everything you've been taught, you're going to hit the real world, it's going to be wrong. You're going to find it's very different when you get out there and things are icky, things will not work, people will not listen. Doesn't matter how many times you turn around and say, this is a good way of doing this, they won't listen. I still have to now tell people to do TDD. I don't know why, and sometimes I lose the will to live, but this is the way the real world is now, I in that module, when I proposed it to Oxford, I said, I'm going to teach people how to be wrong. I has looked at like I was a madman. Maybe I am, but I thought it was a great skill. To me, it's a superpower. I am regularly wrong. At university, I was supposed to never be wrong. You're rewarded for being right all the time. I teach people how to be wrong for a living and so let's talk about that. So what is it to be wrong and why are we scared of it? If you look at the dictionary definitions, it's about not being correct or true. Already I'm not liking it. You're being incorrect. Most the students on my course, they sit there and go, this is horrible. Now I don't want it. This is like a cross versus a tick, okay? Gets worse. It could be injurious, unfair or unjust. Now borderline. You're being very naughty. Okay, go one step further. Inflicting harm without due provocation or just cause. Now you're hurting somebody. This is what being wrong means to us in the english language. And it gets one step worse because now the last one is a violation or invasion of legal rights of another. Now you're going to jail for it. So we've really packaged being wrong as a terrible, terrible, terrible, terrible thing, and yet it's a superpower. So let me explain why that is. First of all, when are we ever wrong, right? I told you, you're all geniuses. Youre all here early doors for a chaos engineering day. So clearly you're passionate about what you do for a living, and therefore you're never wrong, ever. It's a state of being mistaken or incorrect. So we know how it relates back to the word wrong. And I've got some really bad news for you. Dr. Russ has got some bad news for you. Everyone is wrong all the time. All the time. Every system but is wrong. But I used to tell people has a developer, I want you to have a mantra. I want you to have this in your head at all times. We don't know what we're doing. They don't know what they want, and that's normal. If youre could just have that rather than I know exactly what I'm building today, then you might build things a little differently and you'll certainly embrace production slightly differently. And that's what I want you to do. It's a mindset switch to be working in chaos engineering, and it starts with recognizing we're always wrong. Okay, don't take my word for it. Just go out there and look for incidents. Yeah, you have to search for incidents. If you search for surprises online, you get different answers. But if you search for incidents, you might find them and you find a whole plethora of different things out there. Some great incident reports, some of them are terrible. Some of them read like, we knew what was happening. We knew exactly how this was going to go from the moment it happened. Those are called lies. They're not called incident reports. The best ones are the ones that involve some sort of swearing and a window on the people involved, the incredibly complex cognitive processes that go around, oh, my goodness, something's happened and we didn't see that coming. Those are the ones to look for. Okay, but why is wrong scary? If we know it happens all the time and we're going to be wrong, we are always wrong. That's why we have agile agility is about being able to adjust. Why would you need to adjust? Because you're not going to go straight to the answer straight away. So why is wrong scary? Okay, risk, maybe it's risk. We feel at risk. Risk is a terrible term, though. I mean, most people understand risk to a certain degree, but risk has got a whole lot of baggage with it. What we really care about is consequences. When someone says you're worried about the risk, youre actually saying, well, yeah, that sounds like a nice phrase of I don't want to go to prison. Yeah, I'm worried about consequences for things I've done. And that's fair because wrong is attached to consequences. We believe, okay, and why are we so susceptible to it? Two factors. You want to move quick. Most of the companies I go to, I say, do you want to deliver things faster? They say, yes, most of them, some don't. I have one that said, no, we don't want to at all. That's a story for a conversation over coffee later. But most of us want to move quickly, want feature velocity, and we also want reliability because feature velocity without reliability is nothing. I don't know how quickly you can ship a product and say, look at this amazing thing, and if it doesn't ever work for them, do you reckon you've got customers? No, they always want both and that's fair. Okay. And we think there's a conflict relationship. I've lost count of the number of times I've heard we don't have time to do testing. We're working on features. What? Really? How do you know you've got any? Oh, well, no one's complained yet. Actually, I heard that once for someone telling me how they knew a system was working. I said, how do you know your system is working right now? And they said, our customers are not calling us. I thought, what an impressive metric that is. It's a real one, unless the phone systems are down. But equally, don't you think you'd be a bit more proactive about this? But anyway, we end up thinking these two things are in conflict now. It's a great body of work out there now on this that proves this is not the case. The good news is that there is no, absolutely no conflict between these two factors. The faster you move, the more reliable you can be. In fact, the people that move the fastest are the ones that are also paying down the reliability as they go. So this is, don't take my word for it. Go and read a book called accelerate, please. Read it. If you haven't read it already. It's not really a bedtime book. It's a scientific book. A lot of data, a lot of graphs. Yes. So it depends. If you have trouble falling asleep, then it's perfect. But no. Generally speaking, if you wanted an exciting change of mind sort of thing before you go to bed, I won't recommend a novel, but this is pretty good. You can always read Jim Kin's novels, of course, the Unicorn project. That might send you to sleep as well, but with better dreams. Okay, so it's actually the two things working in tandem. And these are some of the charts that come from accelerate. I won't go over them in too much detail, but the point here to make is that the faster you go. Let's look at one that really matters. Yeah. The change failure rate. That one there. Okay. The faster you go, which is the high performers, which is down the bottom. The fact that they put less failures into production as they go. So they've already got less to deal with than anybody else. They'll move quicker. Okay, so it's not versus. It's plus. Let's go through. Okay, but I hear this a lot. But we're doing microservices. We've just invested in microservices. And you could change services to anything you like. You could change it to. But we're doing agile now, or the, one of the ones I love at the moment. But we're digitally transformed now. What does that even mean? We've transformed things digitally. Yeah. Right. Okay. But we're doing that. So we've got tests, gates, pipelines, isolation. Oh, my. We've got the lot. We are absolutely covered. I've had this meeting, board level meetings in banks. You'd think they knew, but chaos. But no, we're covered. Believe me. We are doing everything. And I'm sitting there going, you could still do everything and it will still have failures. I actually asked a room full of bankers. I don't know what that collective noun is. A room full of bankers. I asked them, how many here have had a systems failure in the last two months? Almost all the hands went up. Okay, so we agree it happens. Then I asked, okay, who here has got a disaster recovery program that they've exercised in the last month? And a lot of hands went up and said, yes, absolutely. And I said, how many of you did it on purpose? Most of the hands went down. So all of those disaster recovery plans are not exercised apart from in the moment when you don't want to exercise something new. The moment when you're surprised? That's like getting on a plane and then saying, look, in the event of a crash or something, wing it. You know where the windows are, right? Get out. Or something like that. It's just help yourself and hope. Okay? There's a strong relationship between disasters, incidents, surprises, and chaosiq for engineering. Okay, so here's a quick story for you, and I hope this isn't a story you've heard many, many times before. Does anyone know who that is? Margaret Hamilton. This is her featured in Wired not long ago. She has a great story, the beginning of the SRE book. There's a great story right in the preface, which no one reads, by the way. When I wrote one of my books, I put in the preface, it was a book on Uml. Don't judge me. Everyone does things in their youth, right? I wrote a book on UmL. In the preface, I said, uml is for sketching. No one reads that. I still get emails from people saying, I can't make rational rows. Whatever it is. Do the diagram you're showing me. I'm like, that's okay, because it's a tool that's not great at that. So, yeah, people don't read prefaces, but there's a great thing in this preface of the SRE book where it talks about Margaret. And that's the story I'm going to share with you now, the story here is quite straightforward. I'm going to paraphrase quite a bit of it. So Margaret worked as the lead, essentially, on the system software for the Apollo spacecraft. Fabulous job. But this story isn't about her. She is not the world's first chaosiq engineer. The world's first chaos engineer was Lauren Hamilton. Lauren Hamilton used to get taken to work sometimes, and like any responsible parent, when youre kid's there and you're trying to work and I do this, now, what do you do? You get them there. You sit with them. You go, I've got to do some work now can you go away and not bother me? And that's what Margaret did. She gave Lauren a toy to play with the Apollo mission simulator. What a lucky kid. She had no idea how lucky she was. Okay, so she's playing with this thing, and she does what any kid does with a new toy. She breaks it. Breaks it almost immediately, she finds a key combo, manages to flush the mission navigation information out of it, and Margaret does what anyone would do when a chaosiq engineer has done that, says, okay, we should learn from this. She goes to NASA, bigwigs NASA bigwigs turn around and say what every manager says. When you say, we found potentially a flaw in the system, maybe some technical debt, we want to pay it down, what do they say? That'll never happen. I'm glad you found that. That's fine. But in their words, they said, our astronauts are made of the right stuff. They've been trained for months, sometimes years, to be perfect. They will never cause this failure. Just because your daughter has created it and found it surfaced, it doesn't mean it'll ever happen, because we have. We are perfect. The very next mission, Murphy's law comes into effect. Murphy's law is a law of software development and delivery. Don't ever kid yourselves. It's not just a law of life. It's firmly embedded in our psyche. So Murphy's law kicked into action, and this picture would never have happened. Although technically, I suppose it could have happened. But as the spacecraft shot past the moon, maybe because those well trained, right staffed astronauts on not long after takeoff, hit exactly that key combo and managed to flush the information out system. Fortunately, there was a backup strategy because Margaret had learned from the situation, she'd put something in place. She hadn't completely ignored the big wigs, but she also had a compensating strategy in there for it. So the world's first chaos engineer, at least I think it is, is Lauren Hamilton, the world's first resilience engineer? Perhaps not the first. Is Mahgret in this story at least. Okay, but it gets worse for us because youre be forgiven for thinking you don't work on spacecraft. Anyone here do? Oh, damn it. Okay. I've got a friend who used to work on spacecraft, and I used to love that. I used to be able to say, no one works on spacecraft. And you go, actually, fair enough. But it gets worse for us because actually, our systems are vastly more complex than a spacecraft, bizarrely enough. And we have, because of that complexity, something present. And I love the phrase dark debt. The first time I used this phrase was on Halloween, and I can't tell you how much fun I have with it. So dark debt is different than technical debt. Technical debt is something, you know, youre accruing usually. Okay, you can recognize it and go, yeah, we knew we were doing that. We knew it has a shortcut. That's technical debt. If you find that you've made a mistake, that's called surprise debt. And that also is a factor. You can then call it technical debt when you've recognized it. But generally speaking, technical debt is something you plan to do. Dark debt is the evil, evil, evil cousin of technical debt, because dark debt is there. No matter what you do, no matter how right you think you are, it's in that system lurking, and you don't see it. It's the unknown. Unknowns. Okay, so I got asked once, at this point in the talk, how do I measure how much dark debt there is in my system? Right. Let me explain again. It's unknown. Unknown. It's not like dark matter or dark energy that could be measured by the fact we don't know why the measurements don't work. And so we can estimate there's an error and say there's something there. No, dark debt is completely unknown. It's completely building. It's camouflaged. It looks like working software so you can't measure it, but it's there. And there's enough people that have done enough papers now to prove it's there. Okay. Those sort of people are John Osborne. Aaron was going to be here today, so I was going to point out that Aaron got in early and liked that particular tweet, but unfortunately, he couldn't make it today. So there's a whole group of people that have found what this manifestation looks like. And it's the surprising stuff. The bad news is you can't design it out because you don't know you're designing it in. You can't sit there in a design review meeting and go, aha, youre doing that wrong. I see some dark debt. That's not how it works. You can do what you think is absolutely everything, right? And you will still be wrong. So get used to it. We have a plan for that. And over to the business in this regard. This is why they care. 1 hour of downtime costs about $100,000 for 95% of enterprise systems. Not even the jazzy stuff. This isn't your Netflix. This isn't your Disney pluses or anything like that. This is your Jira going down. This is your email systems going down. This is your back office systems going down. This is nothing sexy. This is how much it costs to a business if things go down. And they don't like that, understandably. And that's just in lost revenue and end user productivity. It's got nothing to do with being sued. That's on top of that. So there you go. That's what threatens the business, is this feeling that there's a lot of risk. And you've just told me there's dark debt, which is more risk. And you've told me you actually can't see it till it happens. Okay, so the bad news is, forget Donald Trump. You're not covered. We're not covered. You can't be covered for this, but you can be better at dealing with it because these systems tend to look like, even if you try and fully describe your systems, there's lots of moving parts, there's lots of details. Rate of change is high. Components, functions can change. This is just reinforcing the fact that what you've got is hard to rationalize about. Okay. Reactions to this risk avoidance. Well, okay, we were going to do a cloud native transformation, but actually now we're going to go back to the mainframe. Bad news, mainframes are full of dark debt. They are. They wouldn't break if they didn't. So that's all there, too. Doesn't matter what technologies youre use. Don't get mistaken into thinking that this is just for the new stuff. This is there. In any sufficiently complex or chaotic system, blame is another beautiful one. Right? So I was on stage a couple of years ago now, and I love this story because this is really cool. I asked the entire room, and I won't ask you now because it's an intimate question, but I asked, who here would like to share a post mortem? Who here would like to share an incident they were part of? And this person, it was like I had offered a glass of water to someone wandering across the sahara. They wanted to confess. He shot his hand up. He then shot up himself. He can to the stage. Now, I'm a very short individual, and this has, in Norway, where there are not many short individuals, he ran to the stage, and the stage was about five foot off the ground. So I'm looking down, thinking I could die. And he leapt onto the stage, and he's up there, so I'm not going to get in the way. I step right back and I go after youre. And he turns around to the entire room, about 300 people, and says, it was me. At that moment, you ask youre, what could possibly go wrong next? Do his colleagues know? Has he just done it? Is he going to get arrested? Is he going to be sued? What's going to happen now? And anyway, fortunately, his colleagues started laughing, so I hope they knew. Anyway, he turned around to say, it was me. I destroyed the master node of a grid computing cluster and lost about three weeks of processing and research work for a university. I did it. It was me. Okay, step back from that. Your first reaction to that story, to that person, would have gone something like this, I think it would have gone well. There's an easy answer to this. He doesn't get to touch the keyboard anymore. Done. Problem solved. How do we make this not happen again? He doesn't have a job. Okay, that's one reaction. Not the best. The best reaction, though, is to step back and go, hang on a second. Blame. He's just blamed himself, which is why we shortcut to. He doesn't want to need to touch the keyboard anymore. He's not allowed blame shortcuts to. The solution is that person. So whether you're blaming someone else or they're blaming themselves, the shortcut is too sweet not to take. So that's why. One of the reasons, just one of the reasons, we don't like blame. The way the story went, though, is I asked him, okay, what did you do? Show me the command you wrote. So he wrote it up on the screen, and I showed him, said, show me the command you wanted to write. He showed me the one he wanted to write. One character difference between these two commands. He didn't do it any wrong. It was a trap. This was a grid. How many people here check a grid to make sure it's right? Has there any dialogue that said, are you sure that's going to take down the whole cluster? No, nothing like that. It was just a grid. Nonsensical grid. Kill grid. Wow. Okay, you did. And it took the whole thing down. And we build traps into our systems all the time, and we don't know their traps until we trick them, trip them. So blame, though, leads us in completely the wrong direction. The truth here is there's many, many changes we could make to the systems to help avoid the human, natural human error that always occurs. We're human. We make errors. That's how we work. And in this case, just a little bit of a dialogue that might have just said, are you sure that's going to take down the world? Would have perhaps helped him do the right thing. Okay. There's a whole pantheon of papers on blame, guilt, and everything that goes with it psychologically. So please go out there and look at those. Okay. A bit of reaction, though. Let's turn it around. Let's say being wrong is a super skill. So this person has written a command and gone, oh, I've just killed the entire cluster. Right. Say they were a chaos engineer. Now switch to the mindset of chaos engineering. The mindset is different. The mindset is the mindset of a motorcyclist. Okay, so who here drives a car now, in London. This can be iffy. All right. Most people drive cars legally, I hope when you legally learn to drive a car in this country, there are other countries where you don't have to do this, but in this country usually they teach you to drive defensively. What they say is when you get out on the roads, treat the world as something that can't really see you. So drive like everyone's blind and then you'll be okay. Give them the space because you don't think they can see you. Even if they're looking at you with their eyes, they're probably thinking about something else. So drive defensively and you'll be better. And we all forget that about six weeks after we passed. But that's what we're told usually is. Drive defensively. Assume that everyone can't see you. As a motorcyclist, you're taught something else. You are taught that they can see you and they want you dead. It's not paranoia if it's true and it keeps you alive longer. You literally ride like everything's out to get you. That's one of the reasons that riding a motorcycle is so calming. I know, sounds wrong, right? But it's calming because you can't focus on anything else. You can't think about what's happening at home, you can't think about what's happentm at work. You just have to stay alive and everything's out to get you. And that's why it's so wonderfully zen to be on a motorcycle. Okay, so you ride along and everything, the weather, the little old lady with the little dog, they're out to get you. And that's how you live longer. That's the mindset of a chaos engineer. Every system is out to get you. It's not passive. Production hates you. It knows where you live, it knows when you're on a date, knows when you're trying to sleep. It's a bit like Santa Claus, but really evil. It's watching you at all times and you're going to be wrong with it. So it's a threatening environment. You're going to get things wrong. So you're going to make sure you have as many compensating practices involved so that youre can at least survive longer. Production can survive longer, you can survive longer, has a participant in it. So being wrong is actually a key software skill. And I'm going to invent a new methodology right here today. Forget agile, we're going to do get better at being wrong. Hang on, is that a rude acronym, no. Get better at being wrong. Okay. No, it's never good. I'm a big hater, I suppose, of all certified programs. So if anyone here is inclined to create a certified chaosiq engineering program, I will find you and I know where I can hide the bodies. So, yes, we don't need a certification program. So get better at being wrong is obviously a joke, but we can make it safer to be wrong. And that's what we're going to do with chaos engineering. We're making it safer for us to get things wrong with our system. We know we care about our system. We know we care what it can do, but we also know there are conditions under which that might not happen. So we're going to make it safer to be in those situations. When that does happen, we inject technical robustness. So I'm careful with my terminology here. I hear a lot of people saying I'm going to make my system more resilient, and they mean they're going to make their software more robust. It's a slight technical difference. Robustness is for the stuff you know can happen. Resilience is for the stuff you don't know is going to happen. Resilience is the learning loop that says that when that occurs, we know how to learn from it and get better at it. We practice zero blame, because, as I say, blame is a shortcut to the wrong answer in many, many cases. So we go way beyond blame. We look at dark debt. Dark debt is what we care about as technical chaosiq engineers. We look for dark debt, we surface it. We surface evidence of it. I had this recently. I was writing this book on Chaosiq engineering, and one of the things I kept coming back to is, why do we do chaos engineering? And I get this from users as well. I've created this thing called the chaos toolkit with my best friend and this open source project. It helps you do experiments in chaos engineering, but I get loads of companies saying, okay, we kind of get that. What experiment should we run first? What should we do? We've got kubernetes. What should we do apart from run to the hills? Now, what should we do? I'm not a Kubernetes hater. I actually quite like kubernetes. But, yeah, we've got kubernetes. A lot of moving parts, generally. How do we know that our system will survive certain conditions? What experiments should we run? And it's a fair question. And our usually response is, what do you care about? Do you care about availability? Do you care about durability do you care about the fact your home page is there or not at all times? At most times? If you care about one of those things, you have an objective. If you have an objective, you can start to measure that objective. Okay, so we're starting there, right? If you can measure your objective and you have a target, an objective for it being up perhaps 99% of the time. Great. We've got the ability to have a bit of leeway there. Okay, now we can say, what conditions do you want to put the system under? What do you think could happen? Could it be a pod dying? Well, yeah, probably. Can your network degrade? If it can't, I want that network. What the sort of conditions do you want to get knowledge of that? Your objective will stay within its parameters of being, okay, 99% if that happens. Right, let's do that. Now, you notice there, not once have I said chaos, not once have I said experiment. I haven't really said test, although it is one. What I'm doing is verifying objective. That's the language I'm now using when I help people define chaos experiments. What is the overall business objective that you care about? Make sure you can measure it. Right now we can look at introducing conditions to the system, and we can see what impact on the objective that might have. No one cares or wants chaos. Everyone wants something else and wants to verify it. So that's what the language I tend to use for that, and that helps us with dark debt in that book. What I say in the book is chaos engineering provides evidence of system weaknesses. Verification says what you found is useful because it helps me with an objective. Okay, so chaos engineering could be better phrased as deliberately practicing being wrong. We're going to get better at it and do it more proactively. We're going to verify something all the time. We might continuously. Is that word again overused in our industry. We're going to continuously verify youre system with experiments to see how our system responds, because there's something we care about about that system, and we can measure it. We can prepare for the undesirable circumstances by, in small ways, making them happen all the time. And that's what chaos engineering really is, is deliberately practicing being wrong, and it's an investment in resilience. Now, there's that other r word. So, robustness, resilience, they often get confused. You're resilient if when something unexpected happens, you can respond to it, learn from it, and improve the system. If you have an objective that is measurable and you do chaos experiments against the system, and you can see what that impact on that objective would be. Then you have all of the ammunition to analyze and produce actions out of what you're doing, prioritized actions. This is important because if you just broke a system, so say you did chaos engineering in the smallest possible capacity, where you said, actually, I'm just going to go in and I'm going to screw around with the pods, see what happens. All right, you do that and you go, okay, well, things didn't seem to go very well, but we don't know what the impact was on what we care about, and we don't know how to prioritize any work that we might do out the back of it. Without the objective, you don't know what to improve. You need to be able to say, we did this, it affected this, we care about this. Therefore, we can prioritize these improvements on the back of that. This is why I tend to use the phrase verification of a system rather than merely turbulence and chaos, because it enables the learning loop, it enables us to be able to join things up and go, we know why we're doing this stuff in the first place, and we know what we're actually going to do in response to it tends to look like this. So this is my very rudimentary how things happen. You have a normal system, something goes kablui. Now, I love this detection, diagnosis and fix. Sounds so calm, doesn't it? It's easy. In there is hell. In there is someone being woken up at the wrong time of day. In there is someone going, we don't know how to do this. In there is someone staring out the window, wondering why they are still in this job. Eventually you get round to something you can learn from if you avoid the blame shortcut, for example, and then you can improve the robustness of the system to something that is now known. Now, in this case, it was an outage. It was a surprise. It was a big deal. Wouldn't it be great if we could do that in a predictive way? We can convert it into pre mortem learning rather than post mortem learning, which is chaos engineering. And it looks like this, you do a game day or an automated chaos experiment. It's only two things in chaos engineering. Another reason we don't need a certification program, two things we can do, game days, which is great fun. I've got so many stories about game days, I don't know if I've got enough time to share them, but they are great stories about game days. Okay. Automated chaos experiments. Again, you're doing this in a small way. Small fires. Small fires require less fire hoses, detection, diagnosis, and fix. You can figure out how we've reacted to it before. It actually is a massive problem. Turn it to learning. Turn it into robustness. You do it across the entire sociotechnical system. And this way, we can learn before outages happen, because we can't design them out, but we can learn about them before they become the big catastrophe that we're trying to avoid. So this is the truth. Being wrong is a superpower. Well done. As software engineers, we're wrong all the time. It's a good thing because we can turn it into something being great. And you can have a platform for deliberate practice of being wrong, which is a platform of chaos engineering, and you will. Or verification, as I tend to call it. Okay? So if you want to have a play, have a play with chaos toolkit. That's a nice way to start. Lots of companies stay there. You can do amazing stuff with it. People can talk, but other tools later on. There's truckloads of tools out there in chaos engineering. It's fabulous. The richness of open youre and free chaos engineering tools is a brilliant thing, and we're happy to be part of it. Frankly, Chaos IQ is the thing I work on. If anyone wants to use something like that, it's a verification tool. So it's a software as a service verification tool. If anyone wants to give that a spin, come and give me a shout. I'll be around till lunch. And then, unfortunately, I have to get on a jet. That sounds really, really posh, doesn't it? I have to get on with about 300 others on a jet. Let's point this out. And I'm not at the front. Okay, so these are the books out there, if you want to grab them. One of them. I wrote one of them, actually, I wrote both of them. Both of those. Great. There are other ones. There's a chaos engineering book by the Netflix team, which is fabulous. There's also another chaos engineering book coming out later on this year that I've contributed a chapter to, as well as a few others speaking today. And out there in the wild, learning chaos engineering, I'm told, is a fabulous place to get started in chaos engineering. I'm totally unbiased in that. And, yeah, if anyone does actually buy that, or if anyone asks me very, very nicely, I'll send you a copy. Anyway, it's a 40 quid book. I'll send youre it. And if you want a signature you can. Otherwise, thank you so much for being here today. Chaos engineering is a superpower. I mentioned earlier very quickly. I mentioned earlier that there is no best practices in our world where it's complex and chaosiq. I lied. Chaos engineering is the only one I know of. I can tell you. Just do it and you'll probably be better.
...

Russ Miles

CEO @ ChaosIQ

Russ Miles's LinkedIn account Russ Miles's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)