Conf42 Chaos Engineering 2021 - Online

In the kitchen: A sprinkle of fire and chaos

Video size:

Abstract

Learn by fire! Chaos and Learning!

How do you ensure your food tastes good? Or maybe you don’t and everyone hates your cooking. How do you learn to avoid burning or cutting yourself while cooking? Is the kitchen on fire or is the smoke alarm just complaining that it needs a new battery? = What does all of this have to do with the cloud? Chaos and Learning! She will talk about her favorite ingredient for shipping resilient cloud-native applications.

Summary

  • Today's first question is, how did you learn how to cook? Are you someone that had to learn by watching others cook? There's no right or wrong answer. Everyone learns differently. What are those learning methods that we use in the kitchen that we can apply to building reliable, complex applications and systems?
  • What is the best way that you learn? Are you someone that needs a very specific set of methods in order to pick up a concept? For me, my favorite way was building and breaking.
  • For me to learn and have growth in this cooking space, I have to cook for others. I love tasting the plates as I go. This is something very similar to what we do in building our applications. There's a lot of beauty in cooking, and that is because we can learn.
  • Ana margarita Medina is senior chaos engineer at Gremlin. Medina was born and raised in Costa Rica and her parents are from Nicaragua. One of the things that really, really matters to her is representation in tech.
  • To bring it back to today's focus, we're going to be talking about learning. What are those things that we can do every single day in order to push ourselves past our comfort zone and take a step into learning?
  • The world that we're building relies more and more on the stability of naturally brittle technology. When we are not able to have applications and systems that are up, we suffer downtime. With this current complexity of our systems, we really, really need experimentation.
  • chaos engineering is thoughtful, planned experiments designed to reveal the weakness in our systems. The purpose is not just for breaking things. We do this with the purpose of breaking things on purpose, to learn from those failure points and improve our applications.
  • The scientific method involves observing your system and baselining your metrics. Abort conditions are those conditions that can happen to your systems or things that you might see in the monitoring or user experience. This last step is one of the most important steps in chaos engineering, and that is sharing the results.
  • I did want to go over some chaos engineering experiments that we can kind of create. On Kubernetes, resource limits are in order for you to make sure that things are scaling properly. With these type of experiments we get a chance to inject latency, block off traffic. If you're interested in learning the effects about this experiment, come to one of my boot camps.
  • When we're talking about cloud native applications, we have to make sure that we're ready for failure. Practice is one of these key terms in building resilient applications. If you do perform some chaos engineering experiments, you get a chance to understand the failures. And continuously improve on those issues.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Hey all, thank you very much for tuning into my talk. And thank you conf 42 for putting this virtual event together. Today I'm going to be talking about sprinkles of chaos and fires and things that can happen in the kitchen. How is it that we can take a moment to learn in those situations? And what are those learning methods that we use in the kitchen that we can actually go ahead and apply to building reliable, complex applications and systems? Today's first question is, how did you learn how to cook? Are you someone that had to learn by watching others cook? Whether it was YouTube tutorials, cooking shows, watching a family member, are you someone that had to learn by actually doing some hands on and actually attempting some of those recipes? And are you someone that just needs to have feedback early and often in the cooking process to be able to know that you're doing things right? This might just be just asking yourselves as you cook, how much salt am I supposed to be using? How much temperature do I need to run this on? Or are you one of those folks that had to use a fire extinguisher to learn? There's no right or wrong answer. Everyone learns differently. And that's why I want to kick off today's talk with this question. How do you learn? Are you someone that needs a very specific set of methods in order to pick up a concept? And when we bring it back to not cooking topics, what is the best way that you learn? Are you someone that does need that content? Are you someone that needs that hands on learning and be able to apply it? We see some folks really go to a lot of conferences in order for them to learn. A lot of them want to just watch YouTube content and go through tutorials and read blog posts in order to get an understanding of a technology, a different tool, or just take a deep dive into a certain technology topic. And I know for me, my favorite way was building and breaking. I know I wanted to always get hands on learning and then I had to break it and be able to learn how to debug it and be able to share a little bit upon that. So I am talking about the kitchen. I do have to bring it back to things that in the kitchen. How is it that I, my cooking noncareer, have learned that food taste is good for me? I have to spend a lot of time tasting as I cook, whether it's as I'm seasoning or just trying to make sure that the food is going to taste out tasty and that I cook the shrimp and the chicken at the proper temperatures, it comes to that. But the biggest portion that happens to me learning how to cook is that I'm not someone that likes cooking for one. Cooking for myself is not fun. For me to actually be able to learn and have growth in this cooking space, I have to cook for others. And as I cook, I do spend a lot of time reading content, watching YouTube tutorials, and I end up meshing, like four to five recipes together. But I'm someone that has to taste the food as I go. And I brought that example of, am I using enough salt, or is this too much salt, or is this too much salt that I need to throw away my plate and startle over? It goes back to me learning that I hates to go ask for feedback early and often. And that's okay. It's you having to learn what works best for you. And sometimes you're trying to learn a new plate. You might have to burn it one time, two times, three times, until you start getting the handle of the recipe and the cooking methods that you need to use. So, yes, Anna, I'm at a tech conference, supposed to be talking about chaos engineering. Can I stop talking about. Yes. Yes, I will. I want to bring it back to this. There's a lot of beauty in cooking, and that is because we can learn. We can constantly take a step back and improve. When I cook, I share that I love tasting the plates as I go. This is something very similar to what we do in building our applications. We want to go ahead and observe and do gradual rollouts of our applications, whether it's perfecting a recipe before you show it to a loved one, or whether it's making sure that you take out the chicken from the pan and make sure that it's actually cooked properly. And I also mentioned that other portion that I love cooking for others. And that is because my focus is on that end user experience, those customers. And this is exactly that beauty where experimentation comes in. Whether I'm actually trying to add a little sweeter kick or a spicier kick to my plate, or I'm just trying to get my empanadas crispier. And sometimes, and often, this actually happens in my house, I actually need to burn my plate in order for me to learn. This is where we go ahead and we take that concept into building applications as replicating past incidents we hates to learn and practice in order to perfect a skill. And if that means having to use a fire extinguisher along the way or throw away your plate, that's okay, because you're doing this for learning. With that, I wanted to take a moment, introduce myself. My name is Ana. Ana. Ana. Ana margarita Medina, senior chaos engineer at Gremlin. I love introducing myself as a self taught engineer. I started coding in 2007, got a chance to do a lot of front end work, moved on to learn a little bit of back end, and I somehow transitioned to build iOS and Android applications. In 2016, I got a chance to come into this beautiful world of site reliability engineering and get a chance to actually learn. Where is it that my code runs on? How do I make sure this stays up and down? And that is also when I picked up chaos engineering. And it was that moment where I was like, oh no. I love burning by building and breaking and being able to take these concepts and break them down into little chunks. Whether it was trying to understand Linux capabilities, a certain application, trying to understand the complexity of microservices, all those things made it really fun. And one of the things that really, really matters to me is representation. If you can't see it, you can't be it. So I love making a comment about being a Latina. I was born and raised in Costa Rica. My parents are from Nicaragua. So for any underrepresented person in tech that's watching this, keep on going. You got this. To bring it back to today's focus, we're going to be talking about learning. And what are those things that we can do every single day in order to push ourselves past our comfort zone and take a step into learning. Hopefully all of you have had a chance to think a little bit more about how do you learn, what is the best way that you can pick up something new? And maybe you learn very differently than me, and that's okay. Maybe I actually didn't even cover the way that you actually like learning. And I did want to touch upon some of the other ways that folks do learn. And that could just be by practicing, whether it's trying to pick up a new instrument and going through and doing that work to learn it, or just making sure that you're burning that muscle memory in order for you to continue doing this practice. And hey, if you have to use a fire extinguisher or burn yourself in the oven before learning, I've been there. I just burned myself two days ago trying to use my cast iron. It's always going to happen. And that's okay. I am trying to bring this back to software and technologies and as we know it, the software and technology that we use every single day breaks the world that we're building relies more and more on the stability of naturally brittle technology. The challenge that we face now is how is it that we continue innovating and delivering products and services for our customers in a way that minimizes the risk of failure as much as possible. And when we talk about delivering these experiences to our customers, we have to understand that when we are not able to have applications and systems that are up, we suffer downtime. And downtime costs a lot of money. We hates things that happen during the outage that can be quantifiable, and those things come down to revenue. Go and ask your accounting or sales team to try to understand what are some of those costs that come into play. We also have the portion of employee productivity as your engineers are dealing with an outage, they're not working on features or things to make your product battery. And then that brings us to things that can happen after the outage, which is that customer chargebacks. Maybe you're breaking some of those service level agreements and you have to give money back to your customers. We have this other bucket that makes downtime really expensive, and those are those unquantifiable costs. These things can be seen as brand deformation, whether it's the media picking up that your company or systems are down, or maybe it's just happening all over Twitter. And the thing, too, is that customers don't want to use broken products or applications. And sometimes you can actually go ahead and see that happen pretty easily, especially in the stock market. Overnight, one of those other portions of unquantifiable costs come down to employee attrition. People don't want to work at places where they're constantly going to be firefighting. You're going to suffer burnout rates that are really, really high. And word gets around in the tech industry, where folks talk about this vicious cycle of less people to handle those incidents, which just leads to a more burnout. The average company is expected to lose around $300,000 per hour that they're down and that number chaos. Nothing to do with their high traffic events or any new launch that they're coming up with. And when we talk about building reliable applications, we also have to understand that the world that we're building is only getting more complex, which makes it very difficult for us to operate our applications to continue being reliable. The pressure for faster innovation is driving the adoption of new technologies, whether it's new types of infrastructure, new coding languages and architectures, or just new processes that we want to get a handle on. And when we talk about this complexity, we could also take a step back and understand that the complexity hasn't always been like this in legacy applications. When we were doing waterfall processes in our companies, we only had one release. We only had one thing to care about. We also only had one service to keep up and running. When we had monolith architectures, and maybe when we were only managing hundreds of servers as our organizations were on Prem, that complexity was a lot smaller. But we've lifted and shifted and rearchitected our applications, and now we're in this world where a lot of things are cloud native. And thankfully, we've seen a lot of organizations adopt things like DevOps, and that allows for us to have daily releases that allow for us to deliver better experiences to our customers. We now have microservices. So instead of having one service to keep up and running during all this time, we now have hundreds of services to keep up. And all of those have interdependencies within each other or other third party vendors. We've also seen that in this cloud native world, we now don't only just have hundreds of servers to take care of, but we have hundreds of thousands of Kubernetes resources that we need to make sure are all reliable and tested, and that we have documentation on how to keep them up and running. So with this current complexity of our systems, we really, really need experimentation. Folks just want to move fast and break things. But what if I tell you that there is a better world? A world where you can just slow down just a bit and spend more time experimenting and verifying that you're building things reliably, specifically for our users to constantly be happy with our products, our services, and continue being customers of our companies. At the end of the day, we're building a complex and distributed system, and there are things that we must test for or you will suffer an outage. There's failures that you might see in the industry that happen every few months, that happen once a year, or just outages that get so large that we can take a moment to actually learn from other companies pain points and make our systems better. And that brings me to my favorite ingredient for today's talk, chaos engineering. We're going to talk about this the entire rest of the conversation. The definition of chaos engineering is that this is thoughtful, planned experiments designed to reveal the weakness in our systems. And I have bolded the word thoughtful, planned, and reveal weakness in our systems. Because this is not about just breaking production for fun or making sure that the team that you work with can actually handle their on call rotation. This is about doing it in a very thoughtful plan way where you communicate and you build that maturity. And the purpose is not just for breaking things. We do this with the purpose of breaking things on purpose, to learn from those failure points and improve our applications. As we talk about chaos engineering, I want to take a step back and just explain some of the terminology that's going to come up in today's talk. We're going to be using experiments. This goes back to using the scientific method to go ahead and learn from our systems. By following that scientific method that we learned many years ago, we have that fundamental of creating a hypothesis. If such failure happens to my system, this is what I expect will happen. We also have some safeguards that come into play with chaos engineering, such as blast radius. Blast radius is that surface area that you're running that experiment on. This can be seen as one server, ten servers, 10% of your infrastructure, only one service out of your 100 microservice architecture. That is that blast radius. The other terminology that we have, very similar to blast radius, is magnitude. Magnitude is the intensity of the chaos engineering experiment that you're unleashing. This can be seen as increasing cpu by 10%, then gradually going to 20%, 30%, and such. Or it can be seen as just injecting 100 milliseconds of latency, going up to 300, and incrementing all the way to 800 milliseconds of latency. That is your magnitude. While using the blast radius and magnitude, you can really tell your experiments to be really thoughtful and planned. That last term that I want to cover in this section is abort conditions. Abort conditions are those conditions that can happen to your systems or things that you might see in the monitoring or user experience that will tell you that you need to stop this experiment? This portion is really critical for creating your application. You want to make sure to ask yourself, when is it that I stop running this experiment? When is it that I can make sure that the experiment rolls back? Now that we've covered the terms, let's actually go through this process of using the scientific method. That first one that we start doing is actually observing your system. Observing your system can actually just be by pulling up your architecture diagrams, trying to understand how all of your microservices come together. What is the mental model that you have of today's application? You can also observe your system by just understanding the metrics that are coming in. How is it that this ties into all the other systems in your application? And then that brings us to the next step of baselining your metrics, this can be seen as setting those service level objectives, service level indicators per service. What is it that I can see today? Now that I've covered the terminology that gets used in chaos engineering experiments, let's actually talk about how this scientific method comes together. That first step that we take in the chaos engineering experiment is by taking a step back and observing our systems. This can be done by just looking at that architecture diagram, trying to understand the mental models that you currently have of your application. Or maybe it's trying to understand how all of your microservices talk to each other. The next step that we take after that is that we want to go ahead and understand how our system behaves under normal conditions. This can be done by just baselining your metrics. This can also be seen as a great opportunity to set some service level objectives, set some service level indicators that allow for you to understand whether your application is healthy or not. This allows for us to move on to that next step, forming a hypothesis with the work conditions. This is one of those important steps that you get a chance to take a step back and try to understand what is it that I think that will happen to my application now, but how is it that I can make sure that we don't cause a failure that can affect our customers, and we do set those abort conditions and are ready to take action on them? Then we can actually go ahead and define that blast radius and magnitude and say, I want to run a cpu experiment on just 20% of my infrastructure, and that experiment is going to increase cpu to have all of the cpu running at 70% in all my hosts. We then go ahead and we run an experiment. This is that fun time that you get a chance to do with your team. But many teams don't always get a chance to run the experiment. That doesn't mean that they didn't just learn anything from step one all the way to five. As you run that experiment, you want to take a moment to analyze those results. You want to understand, after you've inputted these conditions into your system, how did it behave? How did this behavior correlate to the hypothesis that you created? And if your experiment is successful, go ahead and expand your blast radius, expand that magnitude, and get ready to run that experiment once again. And if your experiment was unsuccessful, hey, that's okay. You just learned something. Take a moment to actually see what will make your application be more reliable and work on that. Then go ahead and run this type of experiment again, just to make sure that the improvements that you've put in actually help your application's reliability. And that last step is one of the most important steps that we have in the chaos engineering process, and that is sharing the results. This comes into actually sharing the results with your leadership team across your organization. And I always take it a step further and say, go ahead and share the results and share those learnings with the wider communities, whether it's the chaos engineering community, the open source communities of the tools that you're burning with, or just any other type of tech conference, and talk a little bit more about some of the ways that you've been building and breaking things. I did want to go over some chaos engineering experiments that we can kind of create, at least to get you all started in thinking about this. One of the big ones that I've been seeing across the board, whether it's folks on containerized Kubernetes environments or those that have adopted cloud technologies, or hopeful that their applications will scale with regular use, is making sure that you're planning and you're testing those resource limits. On Kubernetes, resource limits are in order for you to make sure that things are scaling properly. But we can also take it a step back and think, how is it that we're making sure that when we're using the cloud technologies, auto scaling is actually set up and that you actually have an understanding on how long it takes auto scaling to bring a new node in, how long it takes for that new node to join the rest of them, and for it to report back to your proper monitoring observability dashboards in order for you to make sure that things are up and running. And these things can actually be implemented in a chaos engineering experiment by just having a resource impact. So for some of the auto scaling work that I do, I always start out by just saying, go ahead and run a chaos engineering experiment and have that increase be up to 60% of cpu on your servers, and go ahead and make sure to run that small experiment on all of your hosts and you create those abort conditions that you'll stop that experiment if your application is not responsive, if you start seeing HTTP 400, 500 errors, anything that doesn't feel right for the customer, and you can also take a step and understand what were the metrics that you were looking at for your systems. It might be things like response rates or traffic rates slowing down. When we think about the hypothesis for an experiment like this, we want to ask of what is it that's going to happen to my system when cpu increases, do I expect that in 2 minutes the new node will be up and running? Or do I expect that traffic from one server is also going to be routed for another one because this new node is actually coming up. One of the other ways that we can think about chaos engineering experiments is trying to understand what happens to our systems when one of our dependencies fails. This can be a dependency on an image provider, a third party vendor that actually processes payments. When our application can't access that resource. What does your user see? What is the user experience like? And with these type of experiments we get a chance to do things like inject latency, block off traffic to a certain port application API URL, and we can start doing that to try to understand how is it that the UI handles this failure, how is it that our entire microservices are coupled in that this becomes a single point of failure that can actually bring us down for a while? And when I set the slides up, the experiment that comes to mind is something running on a Kubernetes environment that on your architecture diagram might just not be seen as a primary dependency. We see it as just a caching layer and that is this redis cart that I have written down here. That hypothesis comes down to me thinking that when my caching layer has a latency increase, this is just my caching layer. The application should also still continue working without any issues. If you're interested in learning the effects about this experiment, come to one of my boot camps and you'll get a chance to understand how this also all couples down. So I am in a kitchen talk and I now have to talk about that recipe that I do have for building reliable applications. It first start off by making sure that we can have availability, that we have capacity to actually run our applications at the large scale that we do need to. When we're talking about cloud native applications, we have to make sure that we're ready for failure, whether it's an entire region having issues or that we're ready to fail over from one data center to the other, from one cloud to the other. If you're multicloud hybrid and it takes it back to that last step where you also want to make sure that you have some form of disaster recovery business continuity plan and that you've been exercising those plans in a frequent manner, it also comes down to that portion of reliability, making sure that our systems can sustain these failures that happen day to day to our applications. It comes to that moment where we want our engineering teams to actually experiment and try to build better products and features and that we also get a chance to continue innovating. As I've mentioned multiple times, practice is one of these key terms in building resilient applications. We're building really complex things that have a lot of dependencies. By doing practice, we are able to understand a little bit more about how all these services and tools play together. But your team is also going to have a chance to be better equipped to go back to that point of reliability and keeping things up and running. So the best thing is that all these things get a chance to come together and be tested and worked on and constantly improved on. If you do perform some chaos engineering experiments, you get a chance to understand the failures, constantly be learning from them, and continuously improve on those issues. People and processes, portions of technologies. Our applications live in such a distributed architecture that things are always going to be breaking. In complex systems, you have to always assume that it will break, or we take it back to Murphy's law. Anything that can go wrong will go wrong. We have to prepare for those failures, and we have to always tell ourselves and our teams always test it, go ahead and break it. Before you go ahead and implement it, you want to go ahead and battle test some of the technologies that you're trying to bring into your organization. This allows for you to understand those dependencies, those bottlenecks, those black swans that you might not be able to see until you get a chance to put it all together with the rest of your applications. You want to understand what the default parameters of this tool are and whether or not this actually works straight out of the box. Is there any security concerns that you need to have in mind with any of these tools? And how is it that this tool or application needs to be connected with the rest of my application in order for me to build it in a reliable manner. You also want to go ahead and always ask, what is it that's going to happen when x fails? X can be any URL, any API endpoint, any little box in your architecture diagram, or even just one of the processes that you have in place. And especially when you're looking at those architecture diagrams, please ask yourselves, what is going to happen if this tier two application goes down. Hopefully you have a good hypothesis for it. Hopefully you've gotten a chance to practice on it and ask that hypothesis question. You also want to take a step forward and ask, what is it that your organization is doing day to day to focus on reliability? This is something that the entire company needs to be focused on in order to have the uptime that your customers might be needing, but these might be things that happen behind the scenes, the shadow work. Or this is you actually picking up technologies like chaos Engineering to innovate in your engineering workspace. You can also just start asking what work is being done today that makes sure that we're actually not regressing into a past failure, that we're not about to relive that past incident that you were on call for five months ago, making sure that you've gone through those euro tickets, making sure that you've actually gone through your systems and maybe replayed some of those conditions that caused that last incident and ask, can our system sustain such failures if there were to happen again? And you want to understand how your system behaves on a day to day basis under normal behavior so that you can get ready for those peak traffic events for those days, that you're going to have more users on your website, or that other things might be breaking within the dependencies that you've built in. You want to remember that you have to practice. You have to question everything, whether it's in systems or general knowledge about your applications. And that is that beauty that always keeps me coming back to chaos engineering. It is a proactive approach to building reliable systems, and you get a chance to build reliable applications and systems, but you also build reliable people and organizations with that. I would like to close out and offer you a nice little takeaway. If you're interested in joining the Chaos engineering community and getting some of the chaos engineering stickers that you see up here on the slide, head on over to gremlin.com talk Anna Anna Conf fourty two. And if you have any questions about this, talk the topic anything regarding to do with chaos engineer or Gremlin, feel free to reach out via email@annaghremlin.com. Or you can reach out via any of the social media platforms. I'm usually Anna underscore M underscore my Dina. And if you're interested in giving a try to Gremlin free, you can always go to go gremlin.com slash Anna to sign up and try the full suite of Gremlin attacks with that. Thank you all very much. Have a great one.
...

Ana Margarita Medina

Senior Chaos Engineer @ Gremlin

Ana Margarita Medina's LinkedIn account Ana Margarita Medina's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)