Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome everybody. Conf 42 Chaos Engineering Conference
and two Chaos engineering war Games I'm Gabor Gerencser.
I work for Vodafone UK as a tech lead in the performance
and Chaos engineering team. By profession, I'm a software
developer engineer in testing. Join me to see
how Chaos Engineering war games can help preparing your organization
for random events. How do you prepare
your organization for the unexpected? In this session,
we will explore random events and their
impact on the organization
and how chaos engineering war games can have to deal with them.
We also discuss why it is important to know what
is your software solution and organization state and how
to convince your stakeholders that chaos
engineering war game is an effective way for building
resiliency. We will dive into different types of
war games we used at Vodafone along with the fun elements
we added to keep participants engaged. I share lesson learned from
more war games, including how to get started,
which exercise works best, managing participant and
setting cause. Lastly, we'll talk about sharing the result of Chaos Engineering
war games within your organization and beyond. It's not just about
the war games, it's about spreading the word about Chaos
engineering and its benefits.
To add a bit of insight, I included a couple of quotes
reminding us that chaos and chaos engineering
has many aspects. This quote, for example, illustrates that
chaos is often a product of our interpretation of events,
influenced by our limited understanding
and as the quote suggests, all lives are filled with
unpredictable event. In fact, amidst all the changes,
these round of events remains the only consistent aspect of
the universe. Their impact is felt both personally and within
the organization we are part of, as well as in the software solution
we manage. Consider these organizations and
software collection as complex systems with numerous
components interacting to produce outcomes.
The complexity of this system makes it challenging to fully comprehend the
behavior and functionality. To make sense of this
complexity, we establish boundaries.
These boundaries allows us to focus on attention or attention
on specific areas such as our organization
or software solution, making this flood of events,
knowledge and information more manageable.
Yet, it's essential to remember that we don't exist
in isolation. We are influenced by our environment,
which can have both positive and negative effect on our complex
system, and you're likely to heard of knowns.
Known knowns are the events we are familiar with
and understand. We can plan for and manage these events because
they hold no elements of randomness. However,
uncertainty arises when encounter events we
are unaware or don't comprehend fully.
The unknown knowns and the known knowns software solution
can mitigate some of these uncertainties through resiliency
practices like circuit breaker, and as information
decreases further, we encounter even more chaotic events.
These are the unknown unknowns. Preparing for these is incredibly
challenging, if not impossible, as they represent
truly disruptive occurrences that catch us completely
off guard. Why it's difficult to provide examples
of these events. The unknown unknowns acknowledging
their existence is crucial as they have the potential
to significantly impact the organization.
Think of chaos engineering war games as
the equivalent of Firedria for an organizations. Just as
emergency services gather to simulate scenarios in
specially designed buildings, these games provide a structured
environment for organizations to test the resiliency of
their software solution and be prepared. The unknown unknowns
as well. Participants engage in simulated incidents,
allowing them to practice and refine their responses
to unexpected events. Similar firefighters
practice different aspects of putting fires out.
Chaos engineering war games allow teams to improve their skills
in handling various types of software failures or incident.
This might involve testing communication protocols, internal processes,
or the application of specific knowledge to resolve
issues efficiently. By actively participating in
these simulations, organization can identify weaknesses in
their systems and processes before they became
a problem. Ultimately, chaos engineering war games help
teams build confidence in their ability to respond effectively
to unexpected challenges, promoting a culture
of preparedness, collaboration, and resilience
within the organization.
Understanding the current state of your system or organizations
is crucial for war gaming and for the
overall preparedness. Without a clear grasp of
where your software and organization stand, it difficult to
spot weaknesses, analyze past incidents,
set goals for war games, or identify knowledge gaps and
inefficiencies in the processes.
Knowing the current state is vital and must be coupled
with good SRE practices like observability.
Observability provide easy access to proactive
measurements and analysis of the system. They can
warn us of issues before they escalate, helping us anticipate
and prevent problems.
Through tools like logs,
metrics, and other indicators, we can gain insight into
the health of our systems. Metrics such as
MTTR defect numbers, incident numbers,
and code quality and the others I listed on
this slide are crucial indicators of system health.
However, it is important not to rely solely on basic metrics.
Understand your software and organization deeply
and focus on what matters for improvements.
Make this as part of your business as usual, not just a one off
exercise. This wealth of information can guide better decision
making, not just helping wargaming, and keep your
organization prepared. What emerge comes in its way.
In previous slides, I discussed the
significance of chaos engineering and its role
in preparing organization for unexpected challenges.
However, simply recognizing it's important isn't enough.
It's crucial to convince stakeholders that chaos engineering
war games are an effective way to enhance organizations
resilience, not just in terms of preparedness, but also
in having collection, knowledge sharing and serving
as a training platform. Stakeholders typically
prioritize the quality and accessibility of the services
provided to the customers as customer satisfaction
directly impact revenue. For instance, let's imagine
an organization with 100 users interacting
with the system generating 1000
pan each per hour. That's potentially
revenue of 100,000 pan. But if
the software solution is inaccessible due to system problems,
revenue is lost. War games helps organizations
to avoid outages and underscore
the importance of war games in improving accessibility and
availability, thus increasing customer satisfaction and
revenue. Take the above numbers and change them a factor
or two to really see the risk of lost
revenue. Just if you increase the 100 users
to 1000 user, we are reaching a million pounds.
So to see the risk of lost revenue due to an amplitude outage
or SpaceX called this amplified outage rapid
unscheduled disassembly the
war games helps identify problems, areas and weaknesses,
allowing teams to address them promptly.
Additionally, they prepare incident resolution
teams to be more effective,
resulting in faster problem resolution. It helps
enhancing quote quality by encouraging reviews
and implementation of resiliency best practices leading
to decreased development cost. While chaos engineering
isn't cost free, it benefits outweighs the expenses.
Costs include preparation time, running the war games,
and potentially environment cost and license cost.
However, the impact of these war games on organizational
resiliency far outweighs the cost in terms of revenue
returns. Analyzing the current state,
understanding weaknesses of your complex system and
evaluating cost are essential for demonstrating the value
of chaos engineering war games to stakeholders.
Presenting this information to stakeholders helps
them understand why the games are crucial for organizational
resiliency and preparedness.
Another quote I like this quote because it reminds me
that chaos or random events is not an enemy but an
opportunity. If we didn't
face chaos and random events daily, we would be
content with the current solution and wouldn't think more efficient
better solution to solve our problems. So it's good to see
chaos not as a negative thing, but as
an opportunity.
We talk about chaos engineering war games in general and
why they are important. The next step is to discuss the types of
war games we use at Vodapone. There are two main
categories, the tabletop exercises and the environment based
war games. Tabletop exercises are an
easy way to get started with chaos engineering war
games. They involve gathering people physically or online
around the table and going through various scenarios that could
affect the software solution or the organization.
This process helps identifying weaknesses in processes,
knowledge, communication, documentation, architecture,
and to a certain degree the software solution itself.
Tabletop exercises typically last one or 2
hours, require no additional environment and
chaos, a minimal cost in terms of man hours. They can be organized
online, as I said, or on site, or the combination of both,
and can focus on specifically on a team or
teams, or involve multiply teams
as well a wider range of system
coverage. However, it is essential to keep
the group size manageable, ideally no more than
30 people. Actually, the most we
had at the tabletop War games at Vodafone was
around 15 people and that worked out quite
well. Record or not
record a war games, it may
affect participants opusness or willingness to talk about
sensitive matters. If recording isn't necessarily necessary,
I suggest not to record the tabletop war games. Instead,
use the goals on notes and
have a scribe who record the important findings and discussions
don't make the tabletop war games complicated,
keep them simple. For example, at Vodafone we
used a PowerPoint presentation to share different random
events and scenarios with the participants. We started with a
simple exercise like missing bubbles or to
warm up people like a specific meaning like
what is MTTR? Or missing a couple
of words from MTTR and then people had to guess what that is.
So it was quite useful to make people
relaxed and then gradually you can continue
with more complex scenarios and formats like
tv shows just to make the tabletop war
games more exciting. The format is not
that important till it helps people to
discuss topics and making it fun as
help because that helps to discuss sensitive
topics and tabletop exercise
is a very efficient and low cost way to start
chaos engineering war games the
more complex war game types is the environment based war games.
It comes with higher cost and it
requires dedicated environments and typically lasts longer than a couple
of hours. We run war games up to 6 hours.
It needs longer preparation due to its complex nature.
It needs a briefing for participants and
the longer retrospective as well. They need to understand the
participants, the rules, what it means to
participate in the board game. For example, if a
test environment is used, it's not exactly like the real
production environment. People need to know what they can and
cannot do in that setting. The goal is
different. Why? It may cover
similar aspect as the tabletop war games such as the
software and the processes. The main focus here is
really on the software solution and how to handle real life incident
and random events and the processes of
the organization as well. As I
said, the emphasis is more on the software side which
makes this war games more complex. To organize
it requires a longer duration,
more participants and the involvement
of multiply teams. This increased complexity means that
having teams availability is crucial. We typically
ask for primary and backup participants from each
team to ensure a participant
from a team, even if the primary participant
is unavailable. You can organize this similarly
to tabletop online on sign or mix
of the port. It's important that participants
have access to the environment and can communicate as they would
do during a real production incident, making the board
game as production like as possible generate
really great value within
the environment based war games, we have different categories
based on the target and the participants availability.
Smaller focus area of war games involve
fewer people from a few teams.
They most cost effective as they may not require,
for example, a full test environment.
The larger scale war games are full or end to
end environment based war games where we test
the whole end to end system. It can involve more
people because generally production incidents in such an environment
takes more people to solve issues. The number
of participants can easily reach to
20 or even to 30 people if we differentiate
further between test and production environment based
war games as you guess, production environment war games
are riskier and the organizations needs to reach a maturity
level to run such war game.
It is important to keep this in mind that it's pointless to
test something if you already know it is broken.
First, fix the known issue and test them in a controlled
test environment. Once you have reduced production
incidence to a rare occurrence,
you can start to run war games there. There is no point to run
it in production if your incident numbers are not
low enough. Use test environment in that case,
when you raise that maturity, then you can switch to production
environment to ensure the team's preparedness.
Participants in war games may have varying levels of experience
from junior to senior, and you can run war
games in a test environment for all level of knowledge,
using them as a training exercise.
However, avoid running war games
for people with limited or no domain knowledge in a production
environment. It is really risky to let somebody
without the knowledge to touch production environments.
Instead, train them first and then expose
them to random events in a controlled environment to prevent
major incidents like self generated
major incidents in production environment.
Just like in tabletop war games, it's important to analyze past
incidents. The goal is not to break
everything, but rather to test system resilience and the
organizations readiness for unexpected events
don't blame that is super important. Keep it safe.
Focus on collaboration and identifying
weaknesses together. Working as
a team will generate
better result. Again, keep it simple.
It's easier to run a war game with less complex scenarios.
You don't necessarily even need to start automation,
just use for
example, an AWS console
and change something to
generate a random incident. For example,
we generally use chaos
toolkit and manual steps in our war games so it's
not fully automated and
avoid causing panic. Make it
clear that the war game is just a simulation and
not a real emergency. With high maturity,
you won't even need to notify people beforehand,
but you need to reach that majority because
the organizations will be well versed in handling unexpected
events. Ensure that you have plan to
roll back or fix any issues caused by the war games.
And as I talked about this earlier,
that setting boundaries and focusing
on the complex system is important.
Full environment based war games may
require to simulate the external environment.
While we are focusing on our complex system, we shouldn't forget
about the environment surrounding our organization
and this helps ensure
that interface like processing and communication to
third party supplier for example are clear for everybody
involved. For example, consider your communication
to the mentioned third party supplier. Do you know who
your contact person there, how quickly they need to respond?
For example, as before we run the war games,
we analyze our communication with third party suppliers.
We created response template still simulate
their communication and we use these during the war game
and to close the different types of war games. Remember that
high power tools, powerful tools, these are in your hands
to improve your organization.
The gamification is super important.
We discussed the more serious side of the war gaming first,
but here there are a couple of examples how you can gamify
the war games and you
can use time element to make it more competitive.
You can run the incident
resolutions, people against people or teams against people.
Again, that helps to have a competitive spirit
for the war game. You can introduce like
monopoly tide shuns cards to introduce chaos
into the Chaos War games. Non software specific
random events like for example your CI CD
pipeline is broken, but you have a p one incident to
resolve or your
communication channel like a chat application is
broken. How the people can communicate without that. So there are
many random events you can introduce to the war games.
For tabletop War games you can use tv
fish show formats to make it more interesting.
You can have a leaderboard to
have a visual representation and progress that can motivate
participate to focus on the war games. And of course
you can board people
participants to boost their morale and maintain
their focus and enthusiasm for the war game itself.
So there are a lot of elements, just make sure that you
keep it safe and keep it fun as
well. So this quote,
similarly to the previous quote shows that
chaos is not always negative, it represents
an innovation. This is quite important to keep
in mind in general moving
forward, let me share a few lessons
from our journey. We began with tabletop
war games, which proved to be effective
and quick method for identifying obvious issues.
As discussed earlier, this allowed
the organizations to address these quickly.
From there we progressed to more complex war games.
Started with a focus area war game and then we switched to
environment, wider environment based war
games. The goal was always to
improve the organization resonance in a
cost effective way and not to show brilliant
chaos engineering gurus. VR I mentioned
a lot of times analyze your current situation is the first
step to define your goals. Keep it simple,
just pull a cable. I mentioned going to AWS
console and change something there. Kill an ecs instance for
example. Keep cost always
in mind. The more complex the exercise, the more costly
it will be to run.
And complexity is not your friend. It increases
the preparation time as well,
so keep it simple. For example,
for one of the board games, we aim for a complex scenario,
believing that we were prepared for the challenge.
However, we faced delays in preparation and had
to scale back to a simpler exercise to
meet the deadlines. So essentially build
up your knowledge, gain confidence and address
initial challenges. Before moving to
more complex exercises.
We talk about the war games. We talk about how to manage
them, how to start them, how to convince stakeholders.
Let's talk about the exercises.
It's very important to keep it simple.
Start with common resilience scenarios.
They are often valid to any systems
without needing to analyze
the system extensively.
For example, consider a common scenario like
slow time, slow response time from an API.
This is very common
scenario and how can you handle it? You can test
it with war games and you can introduce best
practices like circuit breaker as an outcome.
So keep it simple. Use the most common scenarios first.
Again, I'm repeating here, but it's important,
don't test known issues. Fix first and then
test it. And analyzing
the current state is the first step. As we discussed before
and I mentioned the automation,
not automation. You can start war gaming without
much automation or without any automation. It can be
a manual process and then you can go
into automation more. As you learn about the
war games, you increase your knowledge
and the whole war gaming needs to be
production like to prepare your team to handle production
incidents and to detect weaknesses
in the production environment. And that includes the production like
traffic on the system. It's not a must, but usually when
you test microservices, you need a load on the
system to trigger events. However,
don't make it as a blocker. You can use non
load specific exercises
like a database for
failover. So in summary, simplicity is
key. Keep the exercise straightforward and
realistic.
One of the most important side
of war games is the participant.
Their knowledge, skills and readiness determines how well an organization
can handle unexpected events, including the unknown
unknowns. Similarly to testing something you already
know is
broken, it is crucial to so
if it is already broken and you know that your
participant training level is not right, then train them
first to make sure that they can handle
incidents. There is no point to include
them in a war game if they are not prepared for this. You already
know that something needs
to be improved when conducted. Environment based war
game war games it's vital to provide participants
with detailed briefing about the environment. As I mentioned
before, they need to understand the restrictions,
the differences from production environments. Often participants
have more privileges in the test environment. You need to
ask them to restrict. Don't use those privileges because they wouldn't be able
to use it in production environment as well.
So most war games are suitable for any skills.
I mentioned this that don't use
participant without production
environment knowledge or incident resolution knowledge. In production
environment it is a high risk. Train them up first.
And I already talk about backup participants as
well. A lot of random things can happen
to people. They might fail ill,
they might just need to simply go on holidays. So having
backup participants ensures that you can run your war games
successfully. Each war games
needs a host to oversee and ensure it runs smoothly.
Having a scribe, especially if games isn't
recorded, allows for collecting information
for retrospective and to identify improvement areas.
This role can also serve as training opportunity
for junior team members.
Changing roles, for example, during the game
can help collaboration
and knowledge exchange. Let a developer take a SRE
role or vice versa. It really brings the team
together and basically
all these has the participant to
participate in the war game effectively and
they are one of the key factors to have
a successful war games take
their feedback seriously. To continuously
improve the war game processes and the
war games itself.
How to set goals it's
essential to consider both business and technical goals
I mentioned before. Analyze the
current state. Understand the stakeholders priorities
they usually focus on improved revenue, improve customer
satisfaction, set your goals accordingly.
Improving software resiliency improvements helps to decrease
development cost as well and not just improve
the customer satisfaction and generally preparedness.
Collaboration, training, all of these can drive
improved customer satisfaction, but also
decreases development cost and also
it can increase job
satisfaction as well and
setting goals metrics go hand in hand.
You need to be able to analyze the effect
of the war game. You should see improvements as a result.
Coming out of the war games and identified
improvements and the fixed problem
areas. And in summary, it's crucial to
choose goals wisely, aligning with organizational
priorities and selecting metrics to provide
insight into progress and improvements and
areas of enhancement.
So there is another quote, and this quote
serves as a reminder the danger of becoming compilation
with established practices and processes.
Often there is a tendency to adhere to familiar methods
of achieving goals and that
can hinder progress.
Nearly the final topic delivering results
it's important to identify your improvement
areas. It's equally important to
address these improvements areas and risk
or weaknesses. Merely identifying
issues is insufficient. They must
be actively tracked and resolved within the organizations.
When raising issues for improvements, it's vital
to track them and ensure that they are effectively
addressed. Each improvement area should be analyzed
to assess its potential impact on
the organization and the action should be taken accordingly.
You can add a priority to
each of the findings.
Tracking is super important and I already mentioned
that coupling these improvements with metrics actually shows
the progress, the improvements and the value the
war games itself delivers.
And additionally to these, it's crucial to recognize that fixing
something once and testing it once is not sufficient.
Do retest and make sure that a
fixed issue stays fixed.
It's super important to ensure that findings from more games are
visible and have a positive impact on the organization.
Continuous improvements are essential to
maintain resilience and effectiveness
when we are facing random events
chaos around us.
The final aspect of chaos engineering
war game is publicity. Clear communication of
the effort, findings and result of war games
is important. We always
create reports for each war game. As I said before,
detailing it goes exercises use the timeline
to provide a clear understanding and
transparency about the execution and the
objectives. In these reports we list the
indentified improvements with their impact
and probability, helping us prioritize
these. But we also include
recommended actions. So we don't just list the
problem areas, but we suggest that how these can be addressed
and taking these improvement
areas or tracking these improvement areas until they are
delivered is essential to make sure that
these improvements are happening.
It's important to mention participants making
sure that they are aware what
they delivered and that's important.
Not only does it acknowledge their effort,
but it also generates a positive bus and
interest in the future war games as well. So it's
important to recognize the participant.
Educating wider community is important. As we
said before,
don't be shy. Share your achievements,
what you achieved through chaos engineering war games
and while it may seems like a small part
of chaos engineering war games,
publicity is important to ensure that
the continuous use of chaos engineering war
games to improve the organization
resiliency,
and it took a long time.
I appreciate your time and attention during this session. We only
scratch the surface of this vast topic, so if you have further
questions, feel free to reach out to me on LinkedIn.
I'm always open to further discussion for
those who are interested looking into deeper
there are plenty of valuable resources available
and thank you once again for your time and
I hope you enjoy the rest of Conf 42 Chaos engineering conference.
Thank you.