Implementing Chaos Engineering in a Risk Averse Culture

Video size:

Abstract

This presentation delves into the art of advocating for Chaos Engineering in environments where conservative leadership prevails. It is tailored to address the challenges of introducing progressive tech practices in risk-averse settings. Attendees will learn effective communication strategies to convey the significance and benefits of Chaos Engineering, overcoming misconceptions and resistance from traditional management structures. The talk will include actionable insights, real-world case studies, and a guide to navigating the cultural and organizational hurdles in advocating for innovative tech solutions.

Summary

Kyle Shelton is a senior DevOps consultant for Toyota Racing development. He'll talk about the evolution of distributed systems and what apps look like today versus what they look like 1015 years ago. He will also go over the principles of chaos engineering and how to deal with a conservative mindset.
When I first got started in my career, I was working for Verizon and we were deploying VoLTE. Now if you're deploying an app with the cloud native revolution, you can do that within 45 minutes. Understanding what happens when things break is crucial for modern day applications.
If there's a skill set that I'm constantly trying to grow, it's my sales skill set. I had to ask an executive for a couple of start a soccer team for our office. Being able to sell is very important. I personally hate going slow, but slow is steady sometimes.
How do we sell chaos engineering? One of the biggest things is total cost ownership. Being able to put a number on reliability is key in settings chaos engineering. Another way to gain buy in is just ask for small pilot programs.
When you're dealing with big data systems, it's very difficult to predict things that are going to happen, especially in the cloud. The only way to do that is practice. I highly suggest asking for this and asking to be able to test and to run your red team exercises and to practice your incidents.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Thanks for joining my talk implementing chaos engineering and a riskaverse culture. My name is Kyle Shelton and I'm a senior DevOps consultant for Toyota Racing development. A little bit about myself. I have a beautiful wife and three girls. I'm an avid outdoorsman. I love to be outside, love to fish, love to hunt, love to hike. I also train for triathlons and do triathlons recreationally. I'm an avid outdoors. I'm an avid outdoorsman, but I'm also an avid I racing and simulation fan. I love racing simulators. I just built a new pc recently and been doing a lot of I racing. I also enjoy flight simulators and farming simulators, and one of my other hobbies is writing and I have a blog, chaoskyle.com, to where I talk about platform engineering, DevOps, SRE, mental health, and things of that nature. So if you're interested in that, check it out. In today's talk, I'm going to talk about the evolution of distributed systems and what apps look like today versus what they look like 1015 years ago. I'm going to kind of go over the principles of chaos engineering and how to deal with that conservative mindset. A lot of places you might work at maybe will laugh at you if you ask to break things. So I'm going to kind of go over what that mindset looks like and how to sell newer technologies and practices like chaos engineering. Also talk about the art of persuasion and gaining buy in. And then I have a case study from my days as an SRE at Splunk that you can use as a reference point on how to ask for money to do a lot of tests. And then we'll close out with tools and resources, and then I'll get into Q A. So let's start about talking about the evolution of systems architecture. When I first got started in my career, I was working for Verizon and we were deploying VoLTE, which is voiceover. LTE was the world's second nation's first IP based voice network, and our deployments looked like this. We would schedule each, we call them network equipment centers, data centers, and we had our core data centers all throughout the country, New Jersey, California, Dallas, Colorado. And we would schedule six week deployments to where we would spend four weeks on site, racking and stacking the servers, getting everything put together, installing the overcoming system, installing all of the application software, or as we call them, guest systems, and then bringing these platforms onto the network and eventually get subscribers making phone calls through them. And the process was long and we had to build out our capacity kind of in that 40 40 model to where we would only build out to 40% of our capacity. So that if we ever had to fail over from one of our, everything was ha, obviously too. So we always had two things of everything. And so if we ever had to fail over from one side to the other, that it would never be at 100%, it would only be at 80%. So there's a lot of overhead, a lot of costs. We were really good about knowing what type of traffic we would have. But this was early iPhone days, so iPhone releases could really mean a huge spike in traffic, and there's a lot that went to it. And now if you're deploying an app with the cloud native revolution in order to get an mvp into your customers hands, you can do that within 45 minutes, basically for simple base cuts. Obviously, you're not going to deploy a complicated rails app platform in 45 minutes, but you have the tools available, using public clouds or private clouds or whatever, to deploy virtualized systems with all their dependencies quickly. And so with the new way of doing things and the new way of being able to deploy things fastly, being able to scale and have elasticity. So you only pay for what you use, right? If you've ever heard one of the AWS pitches, the big thing is they had all that extra capacity and they started selling. And so you pay for what you use in these scenarios, like these new technologies. And this new way of doing things brings a lot more aspects of failure and a lot more things that can go wrong based on abstractions and dependencies into the system. And so understanding your tailored understanding what happens when things break is crucial for modern day applications. As Warner Vogel said, everything fails all the time, and you have to be prepared for that. And another thing, too, is you are abstracting some of the physical responsibilities through the shared responsibility model. So something can happen on a data center that you don't own that could affect your resilience. And so that's another thing that you have to be prepared for. And so those are some of the challenges of our modern day systems. And that's why chaos engineering kind of came about. When I think about what is chaos engineering in my mind, you define your steady state first. If you are at a point to where you don't really know what a good day looks like, I would first figure that out. Get to your steady state. Get to where you're knowing that your failure rate or your four five x error count or whatever is at x, whatever a good day looks like, for your distributed system, that's what's defined as steady state. So once you have that, then you can start to build hypotheses around disrupting that steady state. So you create a control group and an experiment group, and you say, okay, if I unplug x wire, y will turn off, right? That's just a basic example of hypothesis, but that's what it is. If I do something, x will happen. And then once you have that hypothesis written down or put down in notes or whatever, you use the document. That's when you come in and you introduce chaos in a controlled manner. That's when you come in and you go unplug the cable or you go pour water on your keyboard to see what happens. And after you do that, this is a key point, then you're going to observe what happens, and you're going to look at both groups and then evaluate your hypothesis. And another mistake that I've seen people made is that they want to get into chaos engineering without having a good observability foundation. And if you can't put metrics on your system, and you don't know how to put a value or put a score on your system, then you're probably not going to be very successful with chaos engineering. So I would also focus on that first, defining your steady state and being able to observe that steady state is something that you have to. That's one of the prerequisites for chaos engineering. There's a website, really cool website, that helpful folks at Netflix put out called principlesofchaos.org. And this kind of goes into detail as the principles of chaos engineering. What was their north star when know created this type of engineering? The big thing, know, you build your hypothesis around the steady state, like I talked about it, and you're going to want to vary real world events. So if you search AWS service events or Google service events, they'll give you a long list of the things that have happened from the big s three dying outage, way back in the days of December 7, you can see what has actually happened in the cloud. So, you know, like, okay, at minimum, this could happen because it's happened before. You're never going to ever be able to plan for the unknown, obviously, but those are their biggest events. Your provider has start there and work backwards. Use those published events as a starting point for what to test on. Run your experiments in prod. You want to get to where you're running continuous experiments in prod randomly, automatically, without it having an impact to your customers. We have automation pipelines, and we have jobs that can run and spark within unit tests for failure. There's also ways to just let it go in the wild and just when it runs, it runs. There's a lot of ways you can do this, but keep it automated. Don't have someone sitting there that has a go and pull the plug and make sure that it's automated and that you can run it continuously. And the last thing is you're going to want to minimize blast radius. You don't want your eyes to bring down production. You don't want your monitoring and your testing tools to cause large scale events. So isolate your test. I'll talk about this when we're talking about buy in, but start small. Don't go in trying to just, all right, we want to fail over one region. And don't go in with the mindset that you're going to break everything all at once, all the time. Minimize your blast radius, keep it small and work from there. Iterate off the small wins so that you can get confidence from your leadership to get bigger failures and have bigger events. Do red team events, settings like that. Let's get into the conservative mindset. And we've all worked with those leaders or those managers or principal engineers who have that very conservative mindset of, oh, we don't want to do the new stuff yet, or, oh, we don't want to bring failure in. I thought our goal was to keep the app up as much as possible. Why would we want to come and start breaking things? And one of the key things that I've seen when working with these types of leaders and individuals is it's a riskaverse bottom line first type of mentality. And I think, I'm not trying to downplay the bottom line. Right. I understand business. I understand that sales generate the revenue which keep everybody employed, and that the bottom line is very important. I want to stress that right now. But I also know that sometimes you have to make investments and you have to make investments in time engineering, time tools, systems, platforms, et cetera. Initially that you might not see return on short term, but their long term returns are substantial. So be on the lookout for that type of behavior and understand, hey, everything does get basically falls and delves around the bottom line. But if you notice that leadership has that bottom line first, well, we can't fail because it's going to cost us money. Well, if you don't understand your tailored, it's going to cost you a lot more money. And so recognize that. Recognize that mindset and be able to counterarguments when you're going to sell your case another environment. I think my early days at Splunk, it was like this to where we had a knock transition. And so we were transitioning from an overseas knock to building out an internal team. And during that transition, we got paged frequently. There was a lot of outages, there was a lot of things that were going on, and it sucked. I'm not going to lie, it sucked. Getting paged at 02:00 a.m. Every other night or constantly fighting fires. There was a lot of fatigue. And you can tell with some of the older engineers that fatigue. And that was the whole basis of us bringing that knock back in house because we weren't getting the support from our overseas team and we needed to get better. So notice that PTSD, if you're coming in and everybody's just kind of worn out from getting paged all the time, that's something to look out for. And then the third thing that I've noticed in these conservative mindsets is slow. Everything moves slow. It takes two weeks to get your change in front of a change board. Then it takes another two weeks to get a maintenance window approved, and then you got six weeks in. You're going to run your three lines of code, and if it breaks, then you have to wait another six weeks to get things in. And not being able to move releases fast and not being able to push your code fast. And that's normally due to that conservative mindset that there's a lot of reasons that could be there, but it's slow and nobody likes going slow. I personally hate going slow, but slow is steady sometimes. And that might be what works best for that team at that moment. And your job is to speed things up. That's what to look out for when you're finding that you think you might be in that conservative mindset or conservative environment. Now let's talk about settings, chaos, engineering. And if there's a skill set that I'm constantly trying to grow, it's my sales skill set. I don't work in sales. Right. But there are situations to where you have to sell something, whether it's yourself for a job interview, whether it's a tool that you want, because it's the next best thing, big monitoring tool or something, or whether it's just you want money for a soccer team. I had to ask an executive for a couple of start a soccer team for our office. And one of the things I mentioned was that support sales and Sre just don't get along. We all operate in our own little pods. There's not a lot of communication and camaraderie. And so I was like, hey, if we had a soccer team and we could get groups of all of these environments together on one mission to go have some fun, it'll help with the bonding and the office morale, and it'll help us work better together. And it worked, and I got the money and sure, shit, it really helped the teams get along better, and the teams work together better because there's that just team. When you're on a soccer team, you're all trying to beat. We were trying to play the other tech companies in Dallas, and there's a little bit of pride, and then we don't want to lose. Yeah, it was fun. I got to be friends with folks I would have never been friends with because I didn't work with, and I actually met my wife through that, too. And so being able to sell is very important. Now, how do we sell chaos engineering? I think one of the biggest things that whenever I've asked for a new technology or something new from a leader, one of the things I've been asked is total cost ownership and how to put short term investments into long term returns. Right. Let's say I want this tool that's going to cost us $100,000 a year. Okay, well, what value are you going to get from that $100,000? How do I explain? Okay, well, this tool helps me see things before the customers do. It helps me fix things before they break and page out our on call overnight. So our engineers get more sleep, so they come in and they're not having to flex their schedule because they were paged the night before. So there's not going to be that fallback on productivity from that standpoint. And we can start to create automation so that we get to the point to where the system is just healing itself. And so how do you put a number on that? Well, you define total cost ownership and define that, hey, we might be saving money by going open source and hosting things ourselves on the initial software spend, but if we look at the spend overall from our sres and architects and developers and everybody that has to put more of their time into that platform, okay, that $100,000 investment of the managed tool is going to be less than the $400,000 you're spending on engineering time. So, understanding total cost ownership, understanding how to explain total cost ownership has been the biggest way I've been able to sell things to ask more money, sell a tool or sell a system or something like that. So, understanding your total cost ownership and understanding how to explain total cost ownership is crucial. Being able to put a number on reliability. Happy customers equals happy engineers, which equals happy bosses, which makes everybody happy. The cost of fragility, it can break your company, especially if you're a web SaaS or you provide a service over the web. People paying customers expect you to deliver on your part. And if you are not able to, because your systems are not up and you're not able to fulfill your service level agreements, it costs you money, it costs engineering talent, and it could cost your business. So having those points and understanding how to talk about those points is key in settings chaos engineering. So now we understand what we have to do. How do we do it? Well, I'll start at the top. The benefit and cost analysis of total cost ownership is, where are you going to start? Define that. Get all your data, all your charts and graphs. I had an old vp who used to say charts and graphs, or it didn't happen. I even made stickers. And working at a software observability company kind of makes sense, right? But it's true, right? Be able to prove your points, do your homework, bring your case files, act like a lawyer does when they're trying to persuade a jury their case, and have all of your detailed documents, notes. Do surveys with your developers. If you're trying, do the things that you need to do to make the sale. Another way to gain buy in is just ask for small pilot programs like ask for a small. Just, hey, I want to try this new way of doing things in our test sandbox environment. Is that okay? Start with a win. Iterate on that win, demonstrate that success early, and demonstrate that you can control the blast radius and can control the failure. And demonstrate the ability to say, oh, okay, well, I'm doing this in a scientific manner to gather more data to make our system better. Be able to clearly articulate those things and speak their language. Right. If you're having to sell this to a higher exec that's a business leader, maybe lean less on the technical jargon and more on the business value jargon. Or if you're having to do this to your CTO and he's going to want to know, okay, what types of failures are you going to bring? Are you going to do latency? How is this going to affect us long term? Are you going to have the ability to completely break everything and not have it be recovered? What's your backout plan things of that nature. Speak the language of the leader that you are pitching to, and then the last thing is just iterate and approve. Iteration is the key to this. Being a DevOps engineer, we build pipelines, and I get really excited when I get new errors because new errors mean new improvement. So iterate, improve, take baby steps, don't do big things all at once. Start small and work up to the larger events and the larger scenarios. Don't just go off blasting right away. It's hard to get. Like I said, this is one of the hardest things to ask for, because you are literally asking to break things, but you're asking to break things to make them better. So be sure to include that last part. Now, how do you go about doing this? So I want to talk about a time where I was at splunk and we were going through a very large migration. We were basically going from the transition of ansible and chef to puppet and terraform. And so we were rebuilding our whole provisioning system. And in that migration, we were also moving to the graviton instances and moving from our d series to our I series. And so we had to basically move our whole fleet, 15,000 instances. I think we did it in like eight and a half months. It was a crazy big project, and I was tasked with the largest customer, the biggest, most snowflaked customer we had, with all the custom configurations. They had a lot of things that most customers did not because they were the first customer to give us a million dollars. So they got what they want, right. And it was imperative that that migration went absolutely perfect. Half of their stack was already moving off their security side was already moving off of our managed team. So if we failed on this, we could have lost a lot of revenue. And so our chief cloud officer was very interested in making this a success. And so I had to basically ask for $5 million to make sure it was a success. I asked to duplicate their environment in another environment so that I could fully test and fully break things and see how I would recover if I did break things during the migration and have my backout plan ready to a T. So I asked, I put all the necessary data in front of him, told him it's going to take me three months. This is our plan. We're going to replicate four petabytes of data. We're going to execute the migration probably two or three times. And then once we get that migration executed, we're going to document everything that we think will happen, and then everything that happened during that migration, because when you're dealing with big data systems, it's very difficult to predict things that are going to happen, especially in the cloud. So I did that. I did all the settings, ended up executing the migration flawlessly right through chaos engineering because I was asking to test and asking to break things on purpose in a controlled environment that would not impact the customer. And that investment of $5 million ended up turning a ten x margin because now we're using graviton instead of 400 D 424 xl instances. We're now only using 80 I series. So we got better performance, better cost. We're saving way much more money on using the newer instances, and it's a better customer experience. And I don't think that migration would have happened had we not done those tests. So that's a $20 million swing, right? Like, had we failed and had we not spent that $5 million up front, we would have not been successful. And so I did experience some catastrophic failures through my first test when we were doing that. And so we figured that out on the practice environment and we learned from our mistakes and we grew. And so it's hard to ask for money and to ask for tailored, but if you do it in the right way and you do your homework and you propose the data at hand and show the total cost and what happens in the long run, with resilience, it goes a long way. So I highly suggest asking for this and asking to be able to do things like this and to be able to test and to run your red team exercises and to practice your incidents. Practice, practices, practice. Firefighters, military, navy seals spend more time training for their missions than they do actually executing the missions. Right, because they want it to be perfect. When you're doing your technical operations, you want them to be perfect. And the only way to do that is practice. Some of the tools that I've used in the past, chaos monkey harness and their chaos engineering platform, they have an open source litmus, and then they also have an enterprise level. There's gremlin, which is another great tool. And then chaos Mesh is a cloud native tool that I really like as well. Check these out. And that's the show. Thanks for taking the time to jump into this talk. I really enjoy talking about chaos engineering and having influence. And yeah, check out my blog, chaos.com. I talk about mental health, I talk about platform engineering, resilience, chaos engineering, all that. So thanks for having me.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Implementing Chaos Engineering in a Risk Averse Culture

Video size:

Abstract

Summary

Transcript

Slides

Kyle Shelton

Senior Devops Engineer @ Toyota

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2024 - Online

February 15 2024

Implementing Chaos Engineering in a Risk Averse Culture

Video size:

Abstract

Summary

Transcript

Slides

Kyle Shelton

Senior Devops Engineer @ Toyota

Join the community!