Conf42 Chaos Engineering 2024 - Online

Chaos engineering war games – prepare for the unexpected

Video size:

Abstract

Chaos engineering war games is a potent tool to drive resiliency goals. We conducted several war games to prepare for the unexpected. This presentation serves as an introduction to chaos engineering war games, offering a collection of lessons learned to assist organizations in jump-starting their war gaming initiatives.

Summary

  • Gabor Gerencser: Chaos Engineering war games can help prepare your organization for random events. He says it's not just about the war games, it's about spreading the word about Chaos engineering and its benefits. Ultimately, chaos engineering war games help teams build confidence in their ability to respond effectively to unexpected challenges.
  • It's crucial to convince stakeholders that chaos engineering war games are an effective way to enhance organizations resilience. War games underscore the importance of war games in improving accessibility and availability, thus increasing customer satisfaction and revenue. While chaos engineering isn't cost free, it benefits outweigh the expenses.
  • There are two main categories, the tabletop exercises and the environment based war games. Tabletop exercises are an easy way to get started with chaos engineering War games. The more complex war game types requires dedicated environments and longer duration.
  • The gamification is super important. You can introduce like monopoly tide shuns cards to introduce chaos into the Chaos War games. For tabletop War games you can use tv fish show formats to make it more interesting. Just make sure that you keep it safe and keep it fun as well.
  • We began with tabletop war games, which proved to be effective and quick method for identifying obvious issues. From there we progressed to more complex war games. The more complex the exercise, the more costly it will be to run. Keep it simple, just pull a cable.
  • Start with common resilience scenarios. Fix first and then test it. As you learn about the war games, you increase your knowledge. The whole war gaming needs to be production like to prepare your team to handle production incidents. Keep the exercise straightforward and realistic.
  • One of the most important side of war games is the participant. Their knowledge, skills and readiness determines how well an organization can handle unexpected events. To set goals it's essential to consider both business and technical goals.
  • Nearly the final topic delivering results it's important to identify your improvement areas. Clear communication of the effort, findings and result of war games is important. Continuous improvements are essential to maintain resilience and effectiveness.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Welcome everybody. Conf 42 Chaos Engineering Conference and two Chaos engineering war Games I'm Gabor Gerencser. I work for Vodafone UK as a tech lead in the performance and Chaos engineering team. By profession, I'm a software developer engineer in testing. Join me to see how Chaos Engineering war games can help preparing your organization for random events. How do you prepare your organization for the unexpected? In this session, we will explore random events and their impact on the organization and how chaos engineering war games can have to deal with them. We also discuss why it is important to know what is your software solution and organization state and how to convince your stakeholders that chaos engineering war game is an effective way for building resiliency. We will dive into different types of war games we used at Vodafone along with the fun elements we added to keep participants engaged. I share lesson learned from more war games, including how to get started, which exercise works best, managing participant and setting cause. Lastly, we'll talk about sharing the result of Chaos Engineering war games within your organization and beyond. It's not just about the war games, it's about spreading the word about Chaos engineering and its benefits. To add a bit of insight, I included a couple of quotes reminding us that chaos and chaos engineering has many aspects. This quote, for example, illustrates that chaos is often a product of our interpretation of events, influenced by our limited understanding and as the quote suggests, all lives are filled with unpredictable event. In fact, amidst all the changes, these round of events remains the only consistent aspect of the universe. Their impact is felt both personally and within the organization we are part of, as well as in the software solution we manage. Consider these organizations and software collection as complex systems with numerous components interacting to produce outcomes. The complexity of this system makes it challenging to fully comprehend the behavior and functionality. To make sense of this complexity, we establish boundaries. These boundaries allows us to focus on attention or attention on specific areas such as our organization or software solution, making this flood of events, knowledge and information more manageable. Yet, it's essential to remember that we don't exist in isolation. We are influenced by our environment, which can have both positive and negative effect on our complex system, and you're likely to heard of knowns. Known knowns are the events we are familiar with and understand. We can plan for and manage these events because they hold no elements of randomness. However, uncertainty arises when encounter events we are unaware or don't comprehend fully. The unknown knowns and the known knowns software solution can mitigate some of these uncertainties through resiliency practices like circuit breaker, and as information decreases further, we encounter even more chaotic events. These are the unknown unknowns. Preparing for these is incredibly challenging, if not impossible, as they represent truly disruptive occurrences that catch us completely off guard. Why it's difficult to provide examples of these events. The unknown unknowns acknowledging their existence is crucial as they have the potential to significantly impact the organization. Think of chaos engineering war games as the equivalent of Firedria for an organizations. Just as emergency services gather to simulate scenarios in specially designed buildings, these games provide a structured environment for organizations to test the resiliency of their software solution and be prepared. The unknown unknowns as well. Participants engage in simulated incidents, allowing them to practice and refine their responses to unexpected events. Similar firefighters practice different aspects of putting fires out. Chaos engineering war games allow teams to improve their skills in handling various types of software failures or incident. This might involve testing communication protocols, internal processes, or the application of specific knowledge to resolve issues efficiently. By actively participating in these simulations, organization can identify weaknesses in their systems and processes before they became a problem. Ultimately, chaos engineering war games help teams build confidence in their ability to respond effectively to unexpected challenges, promoting a culture of preparedness, collaboration, and resilience within the organization. Understanding the current state of your system or organizations is crucial for war gaming and for the overall preparedness. Without a clear grasp of where your software and organization stand, it difficult to spot weaknesses, analyze past incidents, set goals for war games, or identify knowledge gaps and inefficiencies in the processes. Knowing the current state is vital and must be coupled with good SRE practices like observability. Observability provide easy access to proactive measurements and analysis of the system. They can warn us of issues before they escalate, helping us anticipate and prevent problems. Through tools like logs, metrics, and other indicators, we can gain insight into the health of our systems. Metrics such as MTTR defect numbers, incident numbers, and code quality and the others I listed on this slide are crucial indicators of system health. However, it is important not to rely solely on basic metrics. Understand your software and organization deeply and focus on what matters for improvements. Make this as part of your business as usual, not just a one off exercise. This wealth of information can guide better decision making, not just helping wargaming, and keep your organization prepared. What emerge comes in its way. In previous slides, I discussed the significance of chaos engineering and its role in preparing organization for unexpected challenges. However, simply recognizing it's important isn't enough. It's crucial to convince stakeholders that chaos engineering war games are an effective way to enhance organizations resilience, not just in terms of preparedness, but also in having collection, knowledge sharing and serving as a training platform. Stakeholders typically prioritize the quality and accessibility of the services provided to the customers as customer satisfaction directly impact revenue. For instance, let's imagine an organization with 100 users interacting with the system generating 1000 pan each per hour. That's potentially revenue of 100,000 pan. But if the software solution is inaccessible due to system problems, revenue is lost. War games helps organizations to avoid outages and underscore the importance of war games in improving accessibility and availability, thus increasing customer satisfaction and revenue. Take the above numbers and change them a factor or two to really see the risk of lost revenue. Just if you increase the 100 users to 1000 user, we are reaching a million pounds. So to see the risk of lost revenue due to an amplitude outage or SpaceX called this amplified outage rapid unscheduled disassembly the war games helps identify problems, areas and weaknesses, allowing teams to address them promptly. Additionally, they prepare incident resolution teams to be more effective, resulting in faster problem resolution. It helps enhancing quote quality by encouraging reviews and implementation of resiliency best practices leading to decreased development cost. While chaos engineering isn't cost free, it benefits outweighs the expenses. Costs include preparation time, running the war games, and potentially environment cost and license cost. However, the impact of these war games on organizational resiliency far outweighs the cost in terms of revenue returns. Analyzing the current state, understanding weaknesses of your complex system and evaluating cost are essential for demonstrating the value of chaos engineering war games to stakeholders. Presenting this information to stakeholders helps them understand why the games are crucial for organizational resiliency and preparedness. Another quote I like this quote because it reminds me that chaos or random events is not an enemy but an opportunity. If we didn't face chaos and random events daily, we would be content with the current solution and wouldn't think more efficient better solution to solve our problems. So it's good to see chaos not as a negative thing, but as an opportunity. We talk about chaos engineering war games in general and why they are important. The next step is to discuss the types of war games we use at Vodapone. There are two main categories, the tabletop exercises and the environment based war games. Tabletop exercises are an easy way to get started with chaos engineering war games. They involve gathering people physically or online around the table and going through various scenarios that could affect the software solution or the organization. This process helps identifying weaknesses in processes, knowledge, communication, documentation, architecture, and to a certain degree the software solution itself. Tabletop exercises typically last one or 2 hours, require no additional environment and chaos, a minimal cost in terms of man hours. They can be organized online, as I said, or on site, or the combination of both, and can focus on specifically on a team or teams, or involve multiply teams as well a wider range of system coverage. However, it is essential to keep the group size manageable, ideally no more than 30 people. Actually, the most we had at the tabletop War games at Vodafone was around 15 people and that worked out quite well. Record or not record a war games, it may affect participants opusness or willingness to talk about sensitive matters. If recording isn't necessarily necessary, I suggest not to record the tabletop war games. Instead, use the goals on notes and have a scribe who record the important findings and discussions don't make the tabletop war games complicated, keep them simple. For example, at Vodafone we used a PowerPoint presentation to share different random events and scenarios with the participants. We started with a simple exercise like missing bubbles or to warm up people like a specific meaning like what is MTTR? Or missing a couple of words from MTTR and then people had to guess what that is. So it was quite useful to make people relaxed and then gradually you can continue with more complex scenarios and formats like tv shows just to make the tabletop war games more exciting. The format is not that important till it helps people to discuss topics and making it fun as help because that helps to discuss sensitive topics and tabletop exercise is a very efficient and low cost way to start chaos engineering war games the more complex war game types is the environment based war games. It comes with higher cost and it requires dedicated environments and typically lasts longer than a couple of hours. We run war games up to 6 hours. It needs longer preparation due to its complex nature. It needs a briefing for participants and the longer retrospective as well. They need to understand the participants, the rules, what it means to participate in the board game. For example, if a test environment is used, it's not exactly like the real production environment. People need to know what they can and cannot do in that setting. The goal is different. Why? It may cover similar aspect as the tabletop war games such as the software and the processes. The main focus here is really on the software solution and how to handle real life incident and random events and the processes of the organization as well. As I said, the emphasis is more on the software side which makes this war games more complex. To organize it requires a longer duration, more participants and the involvement of multiply teams. This increased complexity means that having teams availability is crucial. We typically ask for primary and backup participants from each team to ensure a participant from a team, even if the primary participant is unavailable. You can organize this similarly to tabletop online on sign or mix of the port. It's important that participants have access to the environment and can communicate as they would do during a real production incident, making the board game as production like as possible generate really great value within the environment based war games, we have different categories based on the target and the participants availability. Smaller focus area of war games involve fewer people from a few teams. They most cost effective as they may not require, for example, a full test environment. The larger scale war games are full or end to end environment based war games where we test the whole end to end system. It can involve more people because generally production incidents in such an environment takes more people to solve issues. The number of participants can easily reach to 20 or even to 30 people if we differentiate further between test and production environment based war games as you guess, production environment war games are riskier and the organizations needs to reach a maturity level to run such war game. It is important to keep this in mind that it's pointless to test something if you already know it is broken. First, fix the known issue and test them in a controlled test environment. Once you have reduced production incidence to a rare occurrence, you can start to run war games there. There is no point to run it in production if your incident numbers are not low enough. Use test environment in that case, when you raise that maturity, then you can switch to production environment to ensure the team's preparedness. Participants in war games may have varying levels of experience from junior to senior, and you can run war games in a test environment for all level of knowledge, using them as a training exercise. However, avoid running war games for people with limited or no domain knowledge in a production environment. It is really risky to let somebody without the knowledge to touch production environments. Instead, train them first and then expose them to random events in a controlled environment to prevent major incidents like self generated major incidents in production environment. Just like in tabletop war games, it's important to analyze past incidents. The goal is not to break everything, but rather to test system resilience and the organizations readiness for unexpected events don't blame that is super important. Keep it safe. Focus on collaboration and identifying weaknesses together. Working as a team will generate better result. Again, keep it simple. It's easier to run a war game with less complex scenarios. You don't necessarily even need to start automation, just use for example, an AWS console and change something to generate a random incident. For example, we generally use chaos toolkit and manual steps in our war games so it's not fully automated and avoid causing panic. Make it clear that the war game is just a simulation and not a real emergency. With high maturity, you won't even need to notify people beforehand, but you need to reach that majority because the organizations will be well versed in handling unexpected events. Ensure that you have plan to roll back or fix any issues caused by the war games. And as I talked about this earlier, that setting boundaries and focusing on the complex system is important. Full environment based war games may require to simulate the external environment. While we are focusing on our complex system, we shouldn't forget about the environment surrounding our organization and this helps ensure that interface like processing and communication to third party supplier for example are clear for everybody involved. For example, consider your communication to the mentioned third party supplier. Do you know who your contact person there, how quickly they need to respond? For example, as before we run the war games, we analyze our communication with third party suppliers. We created response template still simulate their communication and we use these during the war game and to close the different types of war games. Remember that high power tools, powerful tools, these are in your hands to improve your organization. The gamification is super important. We discussed the more serious side of the war gaming first, but here there are a couple of examples how you can gamify the war games and you can use time element to make it more competitive. You can run the incident resolutions, people against people or teams against people. Again, that helps to have a competitive spirit for the war game. You can introduce like monopoly tide shuns cards to introduce chaos into the Chaos War games. Non software specific random events like for example your CI CD pipeline is broken, but you have a p one incident to resolve or your communication channel like a chat application is broken. How the people can communicate without that. So there are many random events you can introduce to the war games. For tabletop War games you can use tv fish show formats to make it more interesting. You can have a leaderboard to have a visual representation and progress that can motivate participate to focus on the war games. And of course you can board people participants to boost their morale and maintain their focus and enthusiasm for the war game itself. So there are a lot of elements, just make sure that you keep it safe and keep it fun as well. So this quote, similarly to the previous quote shows that chaos is not always negative, it represents an innovation. This is quite important to keep in mind in general moving forward, let me share a few lessons from our journey. We began with tabletop war games, which proved to be effective and quick method for identifying obvious issues. As discussed earlier, this allowed the organizations to address these quickly. From there we progressed to more complex war games. Started with a focus area war game and then we switched to environment, wider environment based war games. The goal was always to improve the organization resonance in a cost effective way and not to show brilliant chaos engineering gurus. VR I mentioned a lot of times analyze your current situation is the first step to define your goals. Keep it simple, just pull a cable. I mentioned going to AWS console and change something there. Kill an ecs instance for example. Keep cost always in mind. The more complex the exercise, the more costly it will be to run. And complexity is not your friend. It increases the preparation time as well, so keep it simple. For example, for one of the board games, we aim for a complex scenario, believing that we were prepared for the challenge. However, we faced delays in preparation and had to scale back to a simpler exercise to meet the deadlines. So essentially build up your knowledge, gain confidence and address initial challenges. Before moving to more complex exercises. We talk about the war games. We talk about how to manage them, how to start them, how to convince stakeholders. Let's talk about the exercises. It's very important to keep it simple. Start with common resilience scenarios. They are often valid to any systems without needing to analyze the system extensively. For example, consider a common scenario like slow time, slow response time from an API. This is very common scenario and how can you handle it? You can test it with war games and you can introduce best practices like circuit breaker as an outcome. So keep it simple. Use the most common scenarios first. Again, I'm repeating here, but it's important, don't test known issues. Fix first and then test it. And analyzing the current state is the first step. As we discussed before and I mentioned the automation, not automation. You can start war gaming without much automation or without any automation. It can be a manual process and then you can go into automation more. As you learn about the war games, you increase your knowledge and the whole war gaming needs to be production like to prepare your team to handle production incidents and to detect weaknesses in the production environment. And that includes the production like traffic on the system. It's not a must, but usually when you test microservices, you need a load on the system to trigger events. However, don't make it as a blocker. You can use non load specific exercises like a database for failover. So in summary, simplicity is key. Keep the exercise straightforward and realistic. One of the most important side of war games is the participant. Their knowledge, skills and readiness determines how well an organization can handle unexpected events, including the unknown unknowns. Similarly to testing something you already know is broken, it is crucial to so if it is already broken and you know that your participant training level is not right, then train them first to make sure that they can handle incidents. There is no point to include them in a war game if they are not prepared for this. You already know that something needs to be improved when conducted. Environment based war game war games it's vital to provide participants with detailed briefing about the environment. As I mentioned before, they need to understand the restrictions, the differences from production environments. Often participants have more privileges in the test environment. You need to ask them to restrict. Don't use those privileges because they wouldn't be able to use it in production environment as well. So most war games are suitable for any skills. I mentioned this that don't use participant without production environment knowledge or incident resolution knowledge. In production environment it is a high risk. Train them up first. And I already talk about backup participants as well. A lot of random things can happen to people. They might fail ill, they might just need to simply go on holidays. So having backup participants ensures that you can run your war games successfully. Each war games needs a host to oversee and ensure it runs smoothly. Having a scribe, especially if games isn't recorded, allows for collecting information for retrospective and to identify improvement areas. This role can also serve as training opportunity for junior team members. Changing roles, for example, during the game can help collaboration and knowledge exchange. Let a developer take a SRE role or vice versa. It really brings the team together and basically all these has the participant to participate in the war game effectively and they are one of the key factors to have a successful war games take their feedback seriously. To continuously improve the war game processes and the war games itself. How to set goals it's essential to consider both business and technical goals I mentioned before. Analyze the current state. Understand the stakeholders priorities they usually focus on improved revenue, improve customer satisfaction, set your goals accordingly. Improving software resiliency improvements helps to decrease development cost as well and not just improve the customer satisfaction and generally preparedness. Collaboration, training, all of these can drive improved customer satisfaction, but also decreases development cost and also it can increase job satisfaction as well and setting goals metrics go hand in hand. You need to be able to analyze the effect of the war game. You should see improvements as a result. Coming out of the war games and identified improvements and the fixed problem areas. And in summary, it's crucial to choose goals wisely, aligning with organizational priorities and selecting metrics to provide insight into progress and improvements and areas of enhancement. So there is another quote, and this quote serves as a reminder the danger of becoming compilation with established practices and processes. Often there is a tendency to adhere to familiar methods of achieving goals and that can hinder progress. Nearly the final topic delivering results it's important to identify your improvement areas. It's equally important to address these improvements areas and risk or weaknesses. Merely identifying issues is insufficient. They must be actively tracked and resolved within the organizations. When raising issues for improvements, it's vital to track them and ensure that they are effectively addressed. Each improvement area should be analyzed to assess its potential impact on the organization and the action should be taken accordingly. You can add a priority to each of the findings. Tracking is super important and I already mentioned that coupling these improvements with metrics actually shows the progress, the improvements and the value the war games itself delivers. And additionally to these, it's crucial to recognize that fixing something once and testing it once is not sufficient. Do retest and make sure that a fixed issue stays fixed. It's super important to ensure that findings from more games are visible and have a positive impact on the organization. Continuous improvements are essential to maintain resilience and effectiveness when we are facing random events chaos around us. The final aspect of chaos engineering war game is publicity. Clear communication of the effort, findings and result of war games is important. We always create reports for each war game. As I said before, detailing it goes exercises use the timeline to provide a clear understanding and transparency about the execution and the objectives. In these reports we list the indentified improvements with their impact and probability, helping us prioritize these. But we also include recommended actions. So we don't just list the problem areas, but we suggest that how these can be addressed and taking these improvement areas or tracking these improvement areas until they are delivered is essential to make sure that these improvements are happening. It's important to mention participants making sure that they are aware what they delivered and that's important. Not only does it acknowledge their effort, but it also generates a positive bus and interest in the future war games as well. So it's important to recognize the participant. Educating wider community is important. As we said before, don't be shy. Share your achievements, what you achieved through chaos engineering war games and while it may seems like a small part of chaos engineering war games, publicity is important to ensure that the continuous use of chaos engineering war games to improve the organization resiliency, and it took a long time. I appreciate your time and attention during this session. We only scratch the surface of this vast topic, so if you have further questions, feel free to reach out to me on LinkedIn. I'm always open to further discussion for those who are interested looking into deeper there are plenty of valuable resources available and thank you once again for your time and I hope you enjoy the rest of Conf 42 Chaos engineering conference. Thank you.
...

Gabor Gerencser

Tech Lead @ Vodafone UK

Gabor Gerencser's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)