Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, this is Oliver. I'd like to welcome you to our
chaos Engineering circus show. If you now
think. Whoa, whoa, wait. Circus? What circus?
I was thinking this is about chaos engineering. Sit back
and relax. I assure you are in the right talk.
It's all about chaos engineering today. So why are we talking today
about chaos engineering? Yeah. Simple answer. We do chaos engineering
for over a year now, and we would like to share
our experiences. And there's always a long story. Our it
systems are historically grown up to a point where we
had to renew them, so the support is discontinued. We had a lot
of brain drain. Some ancient technologies like CobOl,
and the systems are not cloud ready. Now we
have a brand new maintainable platform and the possibility
to roll out stable features and changes within hours,
worthless weeks or months.
But that comes at a cost. We have more teams,
new technologies, new processes, and of course,
the old monolithic world still plays a big part for now.
New challenges everywhere.
So in this talk, we focus on the implementation and operation of chaos
engineering and how we integrate chaos engineering in our daily
work. On our chaos engineering
journey, we took a lot of detours. We made errors and
had burdens and hazards.
So we talk about how we get on track again in the hope
that you will not being caught by our mistakes.
But what's this circus thing now?
Were we see parallels between the organization and operation
of a circus and how software development might
be organized. Please let us know afterwards
if you also see some analogies.
Well, who is Deutsche Bahn and what is Deutsche
Banfatrip? Let me give you some context.
Deutsche Bahn is the biggest train operating group in Germany. And Deutsche
Bahn Fatrip is the interface between the customer and his train ticket.
Basically, we sell train tickets. And for
that reason, we develop software systems to support
that. Customers can
buy tickets over multiple channels, vending machines,
mobile applications, online, and of course, in person,
when were is no pandemic. We sell a couple of million tickets per
day. Now you know what we are doing.
And let's look about today's topics so
how the show evolved.
This is about the roots of our historical systems and what kind of
systems we have now. The next point,
we try to discuss the point if Chaos engineering is
some sort of a magic band which turns every system into a super
robust and resilient platform. And we talk about fire drills.
Fire drills are practiced everywhere. On ships.
The firefighters do fire drills, of course, and in
the circus. So why don't
do fire drills? In the development, we call
a fire drill. In the context of Chaos engineering, a game day.
We explain later the concepts behind game days and other chaos engineering
practices. And there's always a reason not
to do things. Here we talk about the most frequent and absurdly excuses
and how we react on them and the
most painful experiences. Why we are rolling out chaos
engineering at Davy for clear is the last topic we cover in this talk.
So, chaos engineering, what is it? So basically it
states that chaos engineering makes my system
capable to withstand turbulent conditions in production.
So that sounds great. Why don't do everybody chaos
engineering then? Well, that's a good question.
What I know is the value that chaos
engineering gives to us. It helps us to discover unknown
technical debt depth that we were not aware
of. It helps us proving and verifying our nonfunctional
requirements. And it lowers
time to recovery and increases the time
between failure. For proving the success of
chaos engineering, we advise to measure these metrics
from the beginning. Finally,
by the adoption of chaos engineering, we gain more resilience and
robustness as well as a better understanding of the process
in case of failure. And now let's start.
Welcome to the show. Let's have a look at the actors now.
Yes, these were the old times, monolithic age.
The show was well known and no surprises for the audience.
This show was very production. We only had one or two.
Lets painted blue and a sign that state websphere.
This was very sturdy and durable. Yeah, the last decades.
But the surrounding parts of the system were much older and
programmed in languages, which I either haven't heard of or treated
the name a fairy tale. So with this setup,
we could not fulfill the demand of
the new audience because
it took us three months or longer to change a show
or introduce a new act. The audience requests 24/7
shows today or even multiple shows in parallel.
But on the other side, operations loved the show.
It was very stable and known. No surprises.
In contrast to business and developers, they want change,
they want new features delivered to the show.
Of course at a motor and cadence. So business has the
money. So this is how our show looks like
today. Now we have more tents,
24/7 shows, even multiple of them
in parallel. And best of all, shows are updated while they
are running. What we have done to support that,
we changed everything in our software
delivery process. We moved to
a cloud ready technology check stack. We scaled up the
number of teams we used some sort of domain
driven expects to separate our concerns. Continuous delivery
and continuous deployment enabled us to change the
show while it is running. Our quality assurance
had a mind change from manual testing procedures
which took a lot of time towards fully automatic
testing. Also new processes were set
up to support these changes. For example to
have a new incident processes and operation
procedures. So having a powerful
CI CD cannon which allows you to shoot features
directly into production systems,
imagine what could happen if you change the actors jiggling
to can source within the git commit.
So the question is, what could possibly go wrong?
Well, they did. The first feature we released on
our new platform was KCI, also known as
comfort check in. Well, this nice little
feature allows you to check in at a seat in a train and you will
not be bothered by the conductor asking for your ticket. Now you
have quality time while traveling by train. Have a nap.
This was the first feature which connects the new platform and the old
agent monolithic. Guess what happened
at Friday 05:00 p.m.. Yeah,
the new feature went silently down.
Customers start complaining online. Social media
teams weekend was moving far away.
So what chaos happened here? No time
for analysis. Restart the services and cross fingers.
Yes, we are online again. Pooh. Services started working,
but the error has had side
effects and was not only affecting the new feature KCI.
We encountered a cascading error and
customers could no more download their already bought tickets anymore.
This was a serious problem and a bad thing.
At Friday 05:00 p.m. People usually want to get home by train
as a result. So more complaints, more headlines
about the issue on big newspaper websites.
Well, this time we did not make it to the primetime news.
This time, the only reason for that was
that the incident managers did a great job coordinating the
fixes of the issue. But uncomfortable
questions were asked. Concerns regarding our new platform were
expressed from all directions. The most important
question was asked. Maybe you guess what the most important
question is. I bet you know it.
Yeah. Who is responsible for the outage?
Who is the guilty guy? So a big force mortem
revealed it very efficiently. Well, job done.
But maybe we asked the wrong question because only
knowing who is responsible will not prevent us from future outage.
It's more about asking why did we
fail instead of who's responsible.
Yeah, this was a process we had to learn.
As a quick fix, we have added the adjective
blameless in front of postmortems. Yeah, conditions,
improvement already started. Remember,
we have had changed everything. Technology,
culture, responsibilities, and of course the processes
and aside effect. We have this monolithic system which
is still running and which is operated completely different.
So maybe something has to
change. That's what we all agreed on.
But where should we start?
What has to change? Mike will
tell you about some of the ideas we came up with. Thanks, Ollie. Yeah.
Well, so we put our heads together and if you're a big company,
you have many experiments and they all know something. So architecture,
for example, was asking for more governance, and the developers
were asking for more coverage, and Ops was asking for better tracking,
quality assurance, for better documentation. And, well,
overall, it was kind of chaos. But at some
point we agreed on that we need more tests. And these
tests were called technical approval or technical acceptance
tests. So we introduced them.
And, well, you may be asking, do we need more of those tests?
Well, we didn't know better, so we introduced them
because we thought they will fit our goals,
which were basically getting faster with lets errors and a better
ux. So basically making the customers more happier.
And to explain a little bit more about those tests,
here's a slide deck that outlines it pretty well.
So we introduced those.
And what you have to know is in our organization, we have a
gap between Devon Ops. In fact, they are different departments.
And I will come back to the two tests outlined on the slide
in our first game day story later on. But here you can basically see how
that worked. So the developers were doing all the
performance testing, and at some point they threw
over the artifact to Ops and they were doing a rolling update
and tested whether that works or not. And we thought,
well, that will help. And the idea was simple. I mean,
right before going into production, these tests will make sure the
deployed services work and Ops will obviously take care of these
because we thought they have a high interest that stuff works in production.
Well, let me say that much. That didn't really
work out either. So this time we
thought, well, something has to change. And I mean, this time for real.
So we did what every good company does and we asked
ourselves, what would Netflix do? Right? And that
sounds mad, I know, but let's look at it. We do have
lets microservices, point taken. But we still have
them. So the complexity is probably not too different.
You have to solve similar problems. Why Netflix?
Well, because compared to on Prem, Netflix noticed
at some point that different things are important with microservices in the cloud.
So Netflix developers practices and methods, you all know that
chaos engineering to mitigate those problems. Well,
and that led to a higher stability of the whole system. I know we're
not Netflix, but let's take this graph here,
for example. This is a part of the system.
It's only the teams. So there are many more microservices below
that. And as you can see, it's a big older now it's like one
year old. It's not a death star, right?
But the problems are looking the same.
We have a complex technical system. And the funny thing here
is, I mean, it's a complex technical system, but there is
also a complex social system behind this with many teams,
different responsibilities, slightly different approaches,
a different culture in every team a little bit at least,
and a different experience. And even though we
decided to have guidelines for all the teams, like build robustness
and resilience, how would you even test that? So we thought,
well, chaos monkey to the rescue, right on the shoulder of
giants. Let's use the chaos monkey for kubernetes.
And we even got operations to agree on that. So we
put the monkeys to good use and fired kavoom.
So without knowing much, we deployed the monkeys.
And I tell you, they did a great job.
They killed ports and containers in one of
our gazillion environments. But to be honest,
we didn't really know what we were doing.
A fool with a tool is still a fool.
And that's how we felt. Yeah, it ended in tears
once again because the microservices were not prepared and
we started way too big. But what do you guess?
What's the worst? Yeah, the worst was nobody noticed really.
So we didn't have the observability to even detect the errors.
So something went wrong and the teams were
upset that something is not working. But we had no observability.
What was going wrong here? So we
were not prepared. We didn't do our homework,
we didn't communicate enough. We didn't have
the needed observability. Without much of
a plan, we did really dangerous things.
We felt like chaos Engineering has won the worst initiative of
the year award. There was only one last chance
to recover, do our homework
and start over at the drawing board. So let's have
a look about the site from Rosmeitz.
What is chaos engineering about? So here we
didn't think plan or experiments, so we just
deployed. We violated every principle of
chaos engineering. So we started over again with chaos
engineering. Well, that sounds great. So what do
we really need to do that? All these
monkeys and tools need new cages.
And of course we need someone who
can train and keep
track of these tools. So we need specialized trainers.
Get a bunch of consultants to develop excel based
meal and training plans. Yes, get all departments
at the roundtable and design a process that printed out
hardly fits on the walls of a meeting room. So this
is really essential. And of course,
what cost, nothing is no good. So open the treasure
chest and bring a big stop stop, Ollie.
I don't think you will need all of.
I mean, let's focus on what we maybe really
need. I think you just need to start doing
chaos engineering. And what do I mean by that?
Basically, first of all, you have to pick an aspect of the system you
want to experiments on, right? And then you
plan your experiment, you prepare your environment,
traffic monitoring, access rights, all that stuff. And then you measure the steady
state, monitor your system and conduct the experiment.
Essentially break something and document your findings.
And as you can see, I didn't talk about budgets, new cages
or trainers. It's basically about doing it.
So how does it look like? Well, this is
an actual screenshot from one of our earlier game days. And as
you can see, there are many different departments in one room,
which is one of the biggest value of a game day in our opinion already,
because you have ops there, architecture, you can see Ollie and
me, there is business operations and developers.
And someone who took the photo, which I think was also a developer,
is also a developer. So then, as you can see,
I mean, let's come back to our little first game day story.
Well, our first experiment was let's do a rolling
update, but with load. And the funny thing
was we had a development team there and ops
people and people there,
several people there had more experience actually
like 20 or more years in development or operations.
So we all were pretty confident that nothing will go
wrong. And we asked them beforehand whether they think
if we will find any problems. And they all agreed that
we will not find anything. So we designed the experiment,
rolling update on the load. So we basically just combined
the slides we showed you before, like the performance testing
and the rolling update. And what do you get happened?
Well, it ended almost in tears. The error
rate was about 100% for 15 minutes.
And to be honest, it was quite depressing for all of us.
We expected some errors, but not a complete failure in broken services.
And even though we used an orchestra,
like the big kubernetes thing, you probably all know we didn't
expect that the service went down for 15 minutes,
but that was also good. I mean, it was a wake
up call for everyone. On a side note,
after that we never had any problems to get the buy in from management.
So we just showed them that there is something wrong and that
there are hidden problems. Well, what happened,
you guessed, because we have this gap between dev and ops.
They never tested to update the service under load.
They just deployed it without load on a test environment and did
a rolling update. And in retrospective this.
Well, we could have known, I guess, but we didn't. So chaos engineering revealed
that, and we learned something. And this is a big value
already. So what's the value of chaos engineering? Well,
it increases cross team communication to reduce friction.
So we use a game tape to gather many teams in order
to validate functional and non functional requirements. And we asked ourselves,
what would the customer see if we break this or that?
It helps you verify the non functional requirements, basically. And even
if they're not really fixed, you can still guess the most important ones,
I guess. Next thing on the list, you will find the unknown technical
depth, right? The technical depth you don't know right now.
The technical depth you know is the technical depth you have
in your jira or your issue tracker. But the real
value is to find the value to find the technical depth that your customers
will find. So if they click on something and your whole system breaks.
Spoiler usually on Saturdays, it reduces the
time to recovery, because if you practice
failure, you will know what to do. You're not stressed.
And by that also, I mean if you mitigate issues,
or if you're able to mitigate issues, you will increase the time between failure and
which for us, increases the overall resilience of the system, which is
good. Let's talk a bit about tooling. What tools
could you use? Well, basically,
you should invest in a great application performance monitoring solution.
You could use instana, but any other solution in that space
is good too. In the beginning. Well, we mostly
use Pumbaa and chaos monkey for spring boot.
At some point it makes sense to look at more sophisticated
tooling and automation right now. Were evaluating
steadybit, but you could also go for Gremlin or any other open
source platform which will help you. But this is only a small part
of the tooling you could use. So there is a good list on GitHub.
Awesome. Chaos engineering. So go there, check out the links.
There's also information on how to start with chaos engineering and other stuff.
So, well, we have all the protective gear in place.
Everybody's informed and everybody's there, as you
have seen. But we still had problems to get started.
So let's look about these five top five excuses not
to start with chaos engineering. Number five,
go live. Within a few days, you are messing up the test plan.
So yes, we are having crunch time.
Please do chaos engineering in the next release.
So integrate to cope with that excuses.
So maybe you have to integrate chaos engineering into
the test plan. Or even better, shift left and integrate
chaos engineering into your daily development process.
In any case, keep talking and make QA a friend.
They are on the same side. Everybody wants a
stable system and convinced that chaos
engineering may have experiments which can be real
tests in the future. So this is a win win situation.
Just keep on talking to QA. Number four,
we don't have enough access rights. Yeah, often this
actually means that there is a technical gap.
People often do not know where to find the
information and this is indicating
that they are not aware of how to access the systems.
Maybe you can ask questions like have
you tried access the development database? And often
the developers say, oh, well, I can
if you show me. And did you even try it?
Of course you have to try it.
Often these problems disappear when talking and discussing with Ops.
Number three, we can't think of any good experiment
that are useful to us. Whoa. Yeah,
this is an alarm. Your effort in implementation
clear. Operating chaos engineering was not enough. You have
to find out the reasons. Maybe the team just
don't know or is ignorant to chaos engineering.
So give them more coaching and explain the goals from adopting chaos
engineering. It's also in their mind try
to push the experiments to reflect use cases instead of
thinking only technical aspects like DNS or rolling
updates. So give them other formats for game
days. Maybe red teaming is something
they do like more and supply
standard experiments so they can derive their own experiments
from them. Number one. Number two,
sorry, there's no time because right now we
are only doing features. We take care of technical
things later. This is a nice one because
the use cases, technical and non functional requirements
are inherently connected. Dividing them is not
a good idea. Sometimes the developers will be
forced to make this decision or this distinction.
So insistence that everybody knows what
will happen when you accumulate too
many technical debt,
it gets poisoned and will eventually be ignored until
they hit you in production. This approach is very, very risky.
There are many reasons for that. What we often see
is that product owners are
more like backlog managers than a person who feels
responsible for the whole product lifecycle. Number one,
we would love to do chaos engineering, but the product
owner is not prioritizing it. Remember the last argument
about the job of a product owner? Well, the product owner might have
opposing aims and maybe he is
measured by the cadence or count of
delivered features. Sadly, the product stability
suffers from this approach and often this is
pushed by management. But that's another story.
To mitigate his or her possible fear
and uncertainty, stress out the value of
chaos engineering for the product. For example,
show them the value of chaos engineering by doing at
least one game day which has a small scope and
in addition, recruit expectations and
or security as a driving lever or force.
Regardless of the excuses, we did manage to
host some game days and do some chaos engineering. So more than
80 game days done up today and
more important, the number of production relevant weaknesses.
So about 120 plus these
will not hit us in production anymore and we gained
enough trust to host the first game day in our
production environment. Yeah, this is great. What we are very
proud of is the fact that multiple teams
hosted hosting their own game days.
Yeah, that's a great success. And also
we started other game day formats, red teaming,
wheel of misfortune, playing the incident process and
what was very useful to us is having a place where all
our documentation about our experiments will be
collected and be accessible for everyone. And this
helps teams to get ideas for new experiments as well as
the possibility to share their findings.
Because problems are often the same regarding
health checks, DNS rolling updates,
all this stuff and build up community,
promote the finding of the month. Have a place where teams can talk
about chaos engineering and share their experiences,
write some blog articles, record podcast,
all this will help to make chaos engineering more visible and
success. Let's get to our top five
learnings. First of all, people,
processes and practices are key factors. You have to
talk, talk, basically communicate with stakeholders,
get them together in game days to test the system and to
learn more about it. Don't just deploy the chaos monkey,
tooling is the least of your problems. We started
with a pretty simple game day and from that we continued.
More importantly is to share your findings with others so they
can learn from them too. It took us half a
year till we even thought about automating some of our experiments,
and that was around after 50 game days.
Well, remember, basically pre match optimization is the root of
all evil. So start small and aim for production.
Well, we started with the technical system and the social
system came later, but it depends.
Most important is to start where nobody's heard.
Usually that's in development or in a testing or staging environment.
But don't forget you're aiming for production. Otherwise chaos imaging
doesn't really make sense at some point. Don't be afraid to bug people. Don't get
discouraged. Well, what we did, we organized the
first game days. We moderated them, took care of the documentation,
we invited the people basically to make it as easy
as possible for the participants, research allies
like operations and q a quality
assurance. And what that means is,
well, take the plunge if you have to. There are jobs that
nobody wants to do. But if you want chaos engineering in your big
organization, to be a part of
the daily life, well, then you maybe have to take the plunge and stay
positive about it. And most important,
for us, at least, is don't get eaten by the process line.
At some point, other departments want a piece of the cake. As soon as
chaos engineering is successful, quality assurance
wanted to make specific experiments mandatory. So they
asked us whether we could always do them in game days, like making a
checklist screen. And we hate checklists, so this is bad,
we thought, because game day experiments should
be self driven by the participating teams, because they know best what
they need. Right. So this is how we came to hate checklists.
There are better alternative solutions because one
fits them all often doesn't make sense.
Well, and one last thing. There are
no excuses. Even big organizations
can do chaos engineering if they want to. It's just a matter
of time, especially if you're going into the cloud.
So what do we want from you? Well, last but
not least, invite your team to your first game day. Share your findings
with other teams, go out and do chaos engineering.
Thank you.