Chaos Engineering: When The Network Breaks

Video size:

Abstract

Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos Engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems.

Tammy Butow leads a walkthrough of network chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, you’ll learn to identify new ways to use chaos engineering to improve the resilience of your network and services. You will also discover how other companies are using chaos engineering—and the positive results they have had using chaos to create reliable distributed systems.

This talk will share how you can accelerate your understanding of how your network can break (packet loss, blackhole attacks, latency injection, and packet corruption) and impact your services.

Summary

Tammy X Bryant is a principal site reliability engineer at Gremlin. She says chaos engineering is a way to proactively avoid outages and downtime. How do you create a chaos engineering scenario? Start with your top five critical services first.
What are the top five most critical services? Choose one of these critical services. Next up, whiteboard the service with your team. Then select the scenario that you'd like to run.
There are so many different ways that you can get value from chaos engineering. Inject chaos, inject real failure to then trigger real alerts to specific teams. Fourthly, get a good night's sleep. Being up at 02:00 a. m. having to resolve an incident is not good.
My personal favorite metric is MTTD and focusing on improving meantime to detection. I care about it so much that I wrote a book on it. The slower your time to detection is, the worse the impact is for the customer.
Today I want to share some demos. We're going to be using a demo web application that's an ecommerce store. Which of these services matter most to our customers? The cart service, the checkout service and the payment service. Getting out of the critical path is good.
We're going to do a black hole attack on the ad service. This means we're goingto make the adservice unavailable. Everything else still works. That's a good example of a successful chaos engineering experiment. What else could we do to make it even better?
Let's blackhole the recommended product service in Kubernetes. Leave it for 60 responds. Result was a full outage website not working to process orders and we couldn't navigate either. Really easy, quick way to test what happens if you blackhole a service.
The last one that we want to run is a packet loss attack on the cart service. This is definitely another outages situation. What happens here is gremlin is going to abort that attack because it realized there's a problem. But we did learn a lot.
Gremlin offers free bootcamps to train people in chaos engineering. Use chaos engineering more as a shift left practice. Measure the results of your chaos engineering scenario. Prepare an executive summary of what you learned and what you fixed.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi there, my name's Tammy and I'm here to speak to you about chaos Engineering. When the network breaks. You can reach me via email tammy@gremlin.com at gremlin, I work as a principal site reliability engineer. I've been here for three and a half years. I get to do a lot of chaos engineering as a way to proactively avoid outages and downtime. Previously I worked at Dropbox as a site reliability engineering manager for databases, block storage, and code workflows. One of my really proud achievements was being able to get a ten x reduction in incidents in three months, which was awesome. And it was through doing chaos engineering. You can also find me on Twitter tammy xbryant. One of the biggest problems these days is that every system is becoming a distributed system that makes things really complicated. You see a lot of folks with faces like this whenever they join a new company and there's a distributed system, or if they're just trying to add something new. So what is chaos engineering? Thoughtful, planned experiments designed to reveal the weaknesses in our systems. We inject chaos proactively instead of dreading the unknown. This is a really cool example from Netflix. You can see here on the top. This is when things are all going great. Everything tools awesome. There's a nice hero banner to promote, something in particular so that folks could easily click play on it. You can see there's that big red play button there, and then you can see some other items that you could continue watching. And then if you look below, this is what it looks like if there's something wrong with that hero that you'd like to be able to showcase. So say, for example, maybe there's a problem with the file. There was like file corruption. Or maybe you need to take it down because there was some type of issue in the back end could be business related legal compliance, like sign off. Sometimes you need to be able to remove things and roll them back. So instead of having a big x button or having some type of really bad under experience making things not load, showing like a loading screen, instead what we're doing is we're gracefully failing and we're showing like a nice, clean, simple look on the bottom screen where we just hide the hero completely. So the under doesn't even know that anything's broken. So how do you create a chaos engineering scenario? First off, form a hypothesis, then baseline your metrics if you can. I really recommend getting started by having some slos and slis per service. Consider the blast radius. Determine what chaos you'd like to inject, for example, latency. Or maybe you'll make something unavailable using a black hole. Run your chaos engineering scenario. Measure the results of your chaos engineering scenario, then find and fix issues or scale the blast radius of the scenario. Often folks ask me, how do I pick where to start? Though I really recommend identifying your top five critical services first. I don't really like the idea of going for low hanging fruit because in the world of SRE, that's not impactful. You need to focus on fixing the issues that impact your customers. If you've got every single customer impacted by one issue, you've got to fix that rather than just fixing a small, tiny issue that only impacts, say, for example, one customer. And maybe that customer is a lower cost customer, for example, or a customer that you're not trying to target with a new campaign. Something like that. So what you really need to do is figure out what your top five critical services are. I like to ask folks if you could say that now tell me, what are your top five critical services? Often folks aren't really sure, and if you ask different people within the same organization, you'll get different answers. Butow we really want to understand across an entire company, what are the top five most critical services? Then choose one of these critical services. So, for example, I'll give you three different types of critical services that you'll often see. One is monitoring and alerting or observability. That's definitely a critical service. The next is cash, for example, redis or memcache. And then a third, for example would be payments. Another one that you may be thinking about databases. That's another critical service. Next up, step three, whiteboard the service with your team. Then step four, select the scenario that you'd like to run. For example, maybe you'll validate monitoring and alerting. Maybe you'll validate autoscaling. Works as expected. Maybe youll figure out what happens if a dependency is unavailable using a black hole attack. Or maybe you'll use a shutdown attack or a black hole attack to trigger hosts or container failure. Then step five, determine the magnitude, the number of servers, and the length of time. So what is the value of chaos engineering? To me, there are so many different ways that you can get value from chaos engineering. And I think it's a really great, most important, one of the most important things for an SRE to do. I say this because for me, it's been the number one way that I've been able to reduce incidents, avoid downtime, and it's super impactful in a really short amount of time and you don't need tons of people to make it happen. It's really nice and proactive compared to other types of work that you can do. For example, automation, building completely automated systems that don't involve any humans, that takes years of work often, and it takes a lot of engineering effort and often there's actually a lot of failures that come because of the automation. So to me this is a great way to first reduce issues before you do all of that automation work because automation is just going to uncover more problems. So what are some of the values? First off, find your monitoring gaps, reduce signal to noise. Next, validate upstream and downstream dependencies. Thirdly, train your teams fire drills, making sure that folks know who needs to do what when they get paged. Inject chaos, inject real failure to then trigger real alerts to specific folks on different teams. And fourthly, get a good night's sleep. Being up at 02:00 a.m. Having to get paged and resolve an incident, that's not good. We want to reduce that with chaos engineering. Something else I wanted to talk about is MTTD. This is one of my secrets. So my personal favorite metric is MTTD and focusing on improving meantime to detection. I care about it so much that I wrote a book on it with some friends for O'Reilly, some friends from LinkedIn, Twitter, Amazon. We got together and we put down all of our best practice ideas and principles for how you can do this and share that in this book. So check that out for sure. Why do I care about mean times of detection? Because, for example, if it takes you a week to detect an incident and five minutes to resolve it, it still means that it took you a week and five minutes to fix the issue. For a customer, that's bad. The slower your time to detection is, the worse the impact is for the customer. So if you focus on mean time to detection, you can get in there faster and you can resolve issues faster. Now let's dive into network chaos. Today I want to share some demos. We're going to be using a demo web application that's an ecommerce store. I'm going to show you three different kinds of demos. One is a black hole attack. We'll be doing this on the had service. Second is a black hole of the recommended products. And thirdly is a packet loss attack, where we'll actually be injecting packet loss into the cart. One of the main things that I like to think about at the start of this is use a scientific method and determine a hypothesis. So if we make the black hole, if we run a black hole attack and make the ad service unavailable, we think like nothing really that bad will happen. It should gracefully degrade. We should have no outages. That should just disappear. If we blackhole recommended products, also, we shouldn't have any outages. Everything should gracefully degrade. We just wouldn't see recommended products. You know, like when you're on Amazon and you see other things that they recommend that you should buy based on what youre already bought, it should just disappear and you shouldn't even notice if it's not there. If we inject packet loss into the cart, we expect that we shouldn't actually lose any items and our cart should remain intact during this attack. We don't want the cart to drop items. We don't want the cart to not allow us to add extra items. We should still be able to use our cart. So first off, what we want to do, like we talked about before, is drop a whiteboard of our architecture. So you can see here for this demo application, the hipster shop, we've got twelve services. Which of these services matter most to our customers? The cart service, the checkout service, the payment service, shipping. Which ones do we think? So let's think about it like this. What can we remove from the critical path and what do we have to keep? There are a few things here that look really important. Definitely the checkout service, the payment service. We want to understand, like what items are going where with the shipping service. There's also a cash that looks pretty important. Cart service looks important too. Product, catalog service, that's probably important. So this, to me, looks like the architecture of what we would need to keep. And you can see I've removed a few of the different items. So if we look back, I remove the email service, currency service, recommendation service, which recommends products. So those are a few of the things that we think are not in the critical path. But everything here is in the critical path. And the little boxes with emojis, those are items that are even more critical for the user, like the payment service. So getting out of the critical path is good. Always remember this, if you are in the critical path as a critical path service team and you own a critical service, then you're going to have youre eyes on you and more things that you need to pass. More checks, more compliance. So we've got seven critical services. One thing that we want to think about, based on our hypothesis, is does black holing a noncritical path service cause unexpected failures for critical services. So based on our initial scenario that we want to run and our hypothesis, we think that if we make the recommended product service unavailable, we shouldn't have any issues. If that service is not found, we think the path should be okay. Another thing we might want to think about, like we talked about initially, is does injecting some type of network impact, like packet loss in the cart service, cause us to lose data, like for example, our customer cart items? Another thing that you might think here is where are youre data backups and restores? We seem to only have a redis cache. You know that feeling when you've been shopping for ages, you've added all these items to your cart and then suddenly your entire cart gets wiped and you have to go and re add all of those items to the cart. It's annoying for the user and it's also really bad for the organization. All righty, so now let's get started with our first real demo. We're going to do a black hole attack on the ad service. So this means we're going to make the ad service unavailable. So this is our demo application. Here you can see if I click on vintage camera lets. And now I scroll down. This is the recommended product section. Other products you might like. And then here's the ad service. So, advertisement Film camera for sale, 50% off. I can then click on that link and it takes me to the film camera. And then it has another ad for that film camera for sale. If I go to a different product like the terrarium, you can see that there's an ad down here that says airplanes for sale. Buy two, get third one free. So we think we should still be able to use this site. Click on an item, scroll down. But we think this will probably be gone. This box here, the advertisement box, based on our hypothesis, but we do think we should still be able to add items to cart. See, we've added this item and now we have usd 70 22 as our total cost. We should be able to check that out and see that the order is complete. So now what we want to do is we want to run that black hole attack. So this is our baseline good state everything tools. Nice. So we can see that here the had service. Now I'm going to go and say scenario, new scenario, black hole add service. So we think that this should be graceful. No outage, website still functions. Ads are hidden. That's what we expect to happen. That's our hypothesis. So now what we want to do is click over here on Kubernetes and then search for the Kubernetes service. We've selected our cluster Kubernetes cluster group. Now I'm going to go over to the ad service. I just type that in based on our architecture diagram. And you can see here it selected the deployment for ad service and the pod and the replica set for ad service. So we've got that there and we can see that across our two hosts. So we've got two hosts. It actually only has one pod across all of them. So that's already interesting from a reliability perspective. It's not going to fail over gracefully because we only have one pod. If we did have two pods, then potentially, yeah, we wouldn't see an issue at all if we were just going to black hole one of the pods. So now what I'm going to do is scroll down, click network, and go to black hole. Do this for 60 seconds. I don't need to do anything in particular extra, but I could do something more granular, like say, I only want to impact traffic to these IP addresses. I only want to impact these specific remote ports. So traffic going from here to somewhere else or local ports impact outgoing and incoming traffic to and from these local ports. Or I could drop down and select specific providers. All right, let's add that to our scenario. Click save. Now let's run it. So this is a cool live demo. We'll actually be able to see this in real time. So now I'm starting to run this and we'll see what happens. It's currently running. Let's pull that up. Pending is the first state that youll see it in. All righty. Let's see. We'll go back over to our site. I'm going to leads this open. You can see here. I've got that there. Let's refresh the page. You can see now the ad service has disappeared. So it was there before, now it's gone. But let's check everything else still works. If I click on film camera, no ad service is there. But I can add an item to my cart. Great. And now I can place my order. Add a home barista kit to my cart. And I can place my order. All right, excellent. Everything looks good. So I would say that that actually passed. So what we can do here is we know that it passed already. So we can halt that scenario. We can say, yes, it passed. No incidents uncover and yeah, can incident was mitigated. No issues. Graceful. All right, so we've stopped that scenario because everything was good and we saved our results into the system. So that's awesome. That's a really good example of a successful chaos engineering experiment. And we could think there what else might we do to this system to make it even better? We could make it that you could fail over to a different pod. Since there's only one make the ad services still work. If one of them is not available, that could be a future experiment. So some other different experiments that we can run now let's blackhole the recommended product service. So what we want to do is go back over here, recommended products is things section here, other products you might like. So I'm just going to reload that page. You can see here the ad service is now back and we see our recommended product section. So we think it should be the same. This section should just disappear. All right, so let's go scenarios, new scenario black hole recommended products, no outage. And let's find it over here in Kubernetes youll can see we've selected it. Network black hole. Leave it for 60 responds. Add to scenario, save the scenario. Now let's run it. Okay, so we're just going to be blackhalling the recommended products service. It's currently in the status of pending. Looks like there's an issue. So I'm trying to click on vintage typewriter and it's just trying to load but it's not allowing me to get to that page. I've just got it loading. I'm kind of stuck here. Clicked on film camera, same thing. Nothing's happening. Let me try and add something to my cart. Oh, that's not working either. So there's definitely an issue here. I'm going to try and go back to the home page. Okay, so I can get to the home page. Can I click on an item? No. So the only thing that I can do right now, the only thing functioning is enabling me to go to the home page. Let's see if I can get to the cart. No, can get to the cart either. So this page is basically not working at all. This website is not working at all. I can't really do anything much. I can't get to my cart. I can't add items. So we would consider this to be an outage. The website's just not working at all. It's not enabling us to make any purchases. So we're not going to be fulfilling anything from any customer. So that's definitely a bad example. So let's say, all right, we know that this is not good. Let's halt it. No it failed. A potential incident was uncovered and the incident was not mitigated. So let's say full outage website not working at all to process orders or navigate. All righty. Click save there. So that was a bad example. Now that I've halted it, let's just check. Everything's still good. Yep. So really easy, quick way to be able to test what happens if you blackhole a service. Obviously there's some type of issue with hard coding that blackhalling this recommended product service breaks the entire website. Obviously we don't want that. That's really bad. All righty. So we've saved our results into the system and we said it was supposed to be graceful with no outage. But actually the result was a full outage website wasn't working to process orders and we couldn't navigate either. So we learned a lot from doing that scenario and experiment too. Now let's see, the last one that we want to run is a packet loss attack on the cart service. So let's go over and do that. All righty, so now we're going to do youre packet loss attack. Let's go over there. Okay, so here's my shopping cart. You can see I've actually got 21 items in my cart right now that I've added ten bikes, one record player, ten vintage camera lenses. Let's just add one more airplane. Now what I want to do is create a new scenario and we're going to call this packet lost to cart 80%. And we think this should be actually graceful. No cart items lost. Let's go over to Kubernetes and then select the cart. All right, cool. Now we'll go to network black hole and run a black hole attack. We're going to add that to our scenario, save the scenario and run it. Okay. So as that starts to get set up right now it's in pending. We want to go over here and let's see. It should be starting to run now. All right. Hasn't run yet. Still in pending. All right. Now there's some issues. I'm clicking on metal camping mug and it's not letting me load anything. Clicking on city bike. Nothing happening there. I'm going to go and click on the online boutique logo at the top to see if I can get back to the home page. No, I've got 22 items in my cart though. It still says but I can't actually go to my cart. So we do look like we're stuck here. This is definitely another outages situation. So packet loss to the cart service is causing the entire site to be unusable. So that's not what we expected. And then we also now got this 500 error. And we have logs visible to all users, which is also really not good. So obviously a lot of things that we learned there. So what happens here is gremlin is going to abort that attack because it realized there's a problem and we'll say no, it failed. A potential incident was uncovered and the incident was not mitigated. So we got a 500 services error. I'm going to actually copy this over into our system to store those results and click save so we can actually save the error message and refer back to that later on. So, yeah, that was not a good situation, but we did learn a lot. So now, in summary, these are the three different types of experiments we ran. First, black holing leads. That was awesome. Graceful degradation, just like Netflix. Number two, black holing recommended products major incident not good. Have to do some work to figure out what the hard coded issues are there between dependencies. Packet loss to the cart gave us a 500 error, but looks like we didn't have any data loss to our cart. Let's just go and check that to confirm. CArT has 22 items still in it, so no data loss. So we're all good there. All right, so here are some questions you want to ask yourself at the end of all of your chaos engineering work. Was it expected? Did we uncover unknown side effects? Was it detected? We can ensure that our monitoring is correctly configured. Was it mitigated? Did our systems gracefully degrade like in the first example? Did we fix the issues? Whether it's code config or process, can we automate this? We can regularly then run past failures as exercises to prevent the drift back into failure. And let's share our results. Prepare an executive summary of what you learned and what you fixed. You can find this blog post about what I recommend that you write on the Gremlin website. Internal Chaos Engineering report ten x reduction in incidents what kind of charts use? How to explain it to folks? It's really important to share that story. I also recommend that you join the chaos engineering slack. Go to gremlin.com slack and as a summary, what should you do next? Pick one service or application to practice CE on form a hypothesis. Baseline your metrics. Set client centric slos and slis per service. Consider the blast radius. Determine what chaos to inject. Run your chaos engineering scenario. Measure the results of your chaos engineering scenario. Find and fix issues, or scale the blast radius of the scenario. This is what I recommend as you ramp up all of your work when you're doing chaos engineering. Start by making sure your agents are correctly installed onboard a team to be able to help work with you. Run your first scenario. We also offer Gremlin bootcamps, which are a really cool way to get trained up. Those are free. Go to gremlin.com bootcamps, then automate your chaos engineering run game days as a team, do some scheduled attacks, actually meet with your executive team to share those results, and then integrate with your CI CD. So you're actually using chaos engineering more as a shift left practice. If you go to gremlin.com bootcamps, you'll find all of the 101 introduction to Chaos engineering as well as 201 automation CI CD, the more advanced topics, and go to Gremlin blog to hear about the Gremlin Chaos champion program. There are lots of amazing people doing work in this space. Thank you so much for coming along to my talk. Really appreciate it. You can find me on Twitter if you want to ask me any questions. Tammy X. Bryant thank you.

See all 31 talks at this event!

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Chaos Engineering: When The Network Breaks

Video size:

Abstract

Summary

Transcript

Tammy Bryant (Butow)

Principal SRE @ Gremlin

Join the community!

Featured event

2025

2024

Info

Conf42 Chaos Engineering 2021 - Online

February 25 2021

Chaos Engineering: When The Network Breaks

Video size:

Abstract

Summary

Transcript

Tammy Bryant (Butow)

Principal SRE @ Gremlin

Join the community!