Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name is Benjamin. I'm the co founder and CEO of Steadybit
and welcome to my talk reliability in the face of uncertainty.
If you would like to reach out to after the talk, please feel free via
Twitter or LinkedIn. I'm happy to answer all your questions and hopefully maybe you
will provide me some feedback. So let's get started.
Let's start the talk with important questions.
Question what do we strive for? And what goals does each of
us pursue every day? Or in other words,
what's the mission of software development? Because I assume that
we are all part of the software development process.
We are maybe developers, sres, people who are running
the systems, operating the systems. So we are all part of it and we
are all on the mission. And the
mission is continuously improve and deliver a software solution
that reliably delivers value to its customers.
And that's the important part. All of us here have a
common goal and we work on it every day. Happy and satisfied
customers who use our software. Our customers
have high expectations from the software they use and therefore
from us. And from the point of view of our customers,
the software must always work and they are not interested
in the technical complexity that this requires
or even in outages.
What is reliable?
One very important characteristic is reliability
in our system. And yeah, let's therefore take
a look at this characteristic and see what's
the definition. What's behind a
software system is gliable if it's consistently good
in quality and performance.
So it's important to note that this characteristic must
exist from the end user point of view. We all know that our systems
are in a constantly changing environment,
and as a customer, I expect a result from the
system that fits my request in a reasonable time.
And if the system does not give me an answer,
gives me an error, or makes me events wait too long,
it does not fulfill these characteristics and
I as a customer will continue, will move on to
another vendor, maybe.
But if the system is able to answer me in a reasonable
time and with a suitable result, I as
a customer, yeah, I build up trust and
I use this system, let's say gladly and repeatedly,
because I'm able to trust the system.
So in a summary, we as customers,
we trust the system when it's consistently good in quality and
performance.
Today's systems what are the systems we need to
build today to meet our customers expectations?
Let's take a look at today's system. I have brought you two illustrations
to diagrams of a complex system where
it's easy to see how many dependencies exist in
a microservice architecture. And from this quickly. Yeah,
comes the question, why is it so complex? Why do we need to build
it so complex? The answer is
yeah, we need to meet different
requirements. Like we need to handle high load, the system
and the services within,
they need to be resilient and the load needs to be distributed.
We need to be able to do rolling updates under load
and maybe doing load peaks. Yeah, we also
need to create new instances. They must be added automatically
and then cleared up afterwards.
If not, your cloud provider is getting happy but not you.
This resulting dynamic must be considered
during development and appropriate
precautions must be taken. So however, the two
diagrams also only show a part
of it and obscure the underlying layers.
We add with like kubernetes or cloud
providers or maybe with third party providers
for let's say authentication or other things.
That's something we have to handle and
take care. I really assume that everyone
here can share one or two or maybe more positive
and also negative experience with today's
complex systems.
So with this in mind, with this knowledge, this experience we
all have, I really like the quote from Dr.
Richard Cook, which is very applicable.
It's not surprising that your system sometimes fails.
What is surprising that it ever works at all.
And yeah, we haven't solved
the requirements of our customers and the associated
expectations. They make it very difficult for us to
meet them. And why I'm claiming
it such hard. Yeah, let's take a look.
Let's say look at the incidents, the incidents we are all dealing
with every day, maybe. And there was a report done by
the people at fire hydrants. They have analyzed about fifty k of incidents.
And you can see even in small organizations
there are ten incidents each month they
have to deal with because of something is
not working normal if the organization
is growing. Also the complexity and the system
is increasing and therefore the
number of incidents as well. And you can see that the
numbers are getting up on bigger
companies as well. And let's take a look at the time we are spending
on each incident. So it's
about 24 hours in average for each
incident. So we are spending 24 hours from creating an incident,
to analyze the incident, to find a fix, to deploy
the fix, and to close the incident. And that's just the average
time we are spending on. And it does not mean that
our customer is not able to work with our product for 25
hours. No, it's the time we are spending from
opening to closing the incident.
Yeah, famous sentence
by Vana Fugas. Everything fails all the time and that's
something we all have realized in our daily life.
So we should ask, or we need to ask ourselves, what's normal.
And under these conditions I described earlier,
they are providing the impression of everything is being chaotic,
it's unpredictable, and we must ourselves,
is this really normal?
And failures are normal. Failures are normal and are part of our
daily life. But errors or failures are much more.
They contain valuable information from which we can
derive knowledge and bring
up improvements in our systems. We can improve our systems
based on failures. But I would
go one step further and interpret
failures as the attempt. How our
systems want to get in touch with us. They would like to tell us something.
Our systems want to point
us to problems where we should optimize, where we should
fix something. So it's like a communication channel from our systems
and the system is calling for help. Please support me.
Here's something not working. That's something we can do better.
And yeah, we are working under chaotic conditions.
And why chaotic? Because we
are not able to know when
the next failure occurs. And under such
chaotic conditions, chaos engineering is necessary
to improve our systems and train them to deal with failures
and failures. Yeah, failures are the foundation failures.
We know about our starting point and
they help us to develop the
appropriate test cases. So we can use a failure and transform
a failure into a test case in an experiment.
What can we do exactly?
We need to do it proactive. We need to proactively improve
the reliability of our system.
And we need to ask specific questions before
we are in production. And we need to know the answers before we are going
into production. For example, here are three examples.
The first question maybe is can our users continue
hoping while we are doing a routing update? That's something
you should figure out before you're in production, if you are able to do a
routing update under load or not. The second example,
will our service start when the service dependencies
are unreachable? That's something we
can check and we can validate if the
service is restarting or not. Or do we need to take it in a different
order? If so, it's not a good place,
not a good choice. Or the last example,
can our users buy products in the event of
cloud sound outage? Maybe you are running on a
specific cloud provider and sometimes there are sound outages.
Are you able to shift the load in a different zone?
Is there something done automatically? Have you checked it?
Have you tested? So take care.
Now, let's take an
example. Let's take a look at this showcase. It's a
kubernetes based example. There's a microservice
architecture inside, there are multiple services connected
to each other. And now let's identify our
key services. So we need to know, what are our key players, what are
our services we have to
deal with. We have to take care. You can see there
are some lines in between, there are some connections. So let's check
on a deeper level. You can see the entry service is
a gateway. The gateway is connected to four internal services.
That's, of course, one of our key services because it's the entry point
in front of the gateway. There's a load balancer, and the load balancer is handling
the load. On all the instances from the gateway, in this example,
just two. But what's going on
inside after the gateway was called?
So there is hot deals, fashion and toys,
and they are all connected to one specific service,
the inventory service. So that's a key services,
because if the inventory is not working or maybe responding
slow, there's an impact, maybe a high impact
for the hot deals, toys and bestseller.
And that's something we need to know. It's not something. Let's say,
let's hope everything is fine. No, hope is not a strategy. So we have
to check it upfront. How can we
do this proactive?
We need to test under real conditions, early as possible.
And to be honest, happy pass testing is not enough.
We need to improve the confidence in our
systems. So we need to test under real conditions,
like we are able to see in production, because production is not a happy place.
It's not everything is fine in production. No.
And this is where experimentation comes in.
And I believe this is the true reason why chaos engineering exists,
to check what's going on under normal
conditions in production.
Let's talk about the reconditions.
Let's go one step deeper, and let's use a
very technical example. Based on the already known showcase from
some slides ago.
We know that the inventory service is one of our key services,
and therefore we need to know, how is it called
in advanced, what effects a non normal behavior of the
inventory service will have.
We know from our monitoring, and also from our load
tests that the average response time of the inventory
service is about 25 milliseconds. Under these conditions,
we have a reliable behavior in the whole system.
That's what we already know from production,
what we are able to see in production and in our monitoring solution,
but also from our monitoring, we,
or I, have also learned that there
were some spikes in the response time of
the inventory service. And this data tells us, okay, we had some
response times up to 500 milliseconds,
and that's knowledge, or that's data we
can use. So let's use this knowledge and create an experiment to simulate
the high response time of the inventory service by injecting
latency with up to 500 milliseconds.
What we don't know is what impact
the high response times have and whether the service that
depend on them can handle it. That is
exactly what we are now testing. And we will run
an experiment in a preproduction environment and proactively test
how our system behaves under such conditions.
Let's take a look at the experiment,
what experiment we are running. So,
it's a complex failure scenario, and we
can recreate such scenarios with the right
tools, and we can generate knowledge
about how our system is reacting.
So you can see there's a gray element, it's just a wait step. And then
the blue line is a load test, so we can reuse,
and we should reuse our existing load tests
to execute them. But in combination with
some bad behavior which has been injected
during the execution, and that's no longer happy pass
testing, it's like testing under real conditions.
The yellow ones are verification steps. So, for example,
check if every instance is up and running, if a
specific endpoint is responding or not.
And there's a special one.
We need to define the expected behavior of
the system. And if it's not given, the experiment
fails, because the reliability is not given.
So the system is not working as it should.
And that's something we can test. We can write
a test case for it, and the experiment will fail if this check
is not true.
Let's take a look at the experimentation run. So, the experimentation,
the experiment was successful,
successfully, holy. The experiment was successfully
executed, and our system proved that
it can handle short term, increased response times by the inventory
service.
If we now encounter these conditions in production,
we know that our system can handle them, so we are now more
safe. We have tested a specific, complex scenario
from the past in a pre proactive environment,
and the system is able to survive, which is good.
Next example.
Now, the recondition is something
we also found in the past data. So in the failure data of
the past, another failure appears in our monitoring solution.
And that hit our system quite hard.
Now we need to transform this failure into an experiment to validate
if we can handle the situation or not. And the error was
this time for the hotdeal service. So the hot yield service was
not responding in a specific time, so there was a delay.
And this delay was limited in two zones from our
cloud provider followed by a DNS outage in
our Kubernetes cluster. That was the failure scenario.
Now recreate this failure scenario,
let's rebuild it and we will inject the failure
events that occurred. And yeah, we will also check if
the Lisa had behavior is provided
or not. Again, the yellow one is
a check verification. A specific endpoint is called
and it needs to be reachable
and needs to respond with a 200 and HTTP status
code. If not there's
a success rate we can tweak and we would like to
get 100%. Again, the gray
one is a wait step and then in the second line there is a green
one where we will inject latency in
two specific zones for the hot yield service only,
followed by in the third line by an
DNS outage for our Kubernetes cluster. So no DNS
communication inside of this Kubernetes cluster was possible.
That is what the data was telling us from the monitoring.
Now let's execute. And you can see, okay,
a lot of red elements, so something went wrong and you can see in
the top corner the experiment failed. So it's clear that our system is
once again unable to handle this failure and the
behavior we expect is not given and our customers are not able
to use our service. Now it's up to
us to adapt our service to deal with the failure,
but we need to mitigate the impact.
And as we said in the beginning, failures are normal and will always
occur, but we have it
in our hands to control the efforts. And failures
are of often failure and our system
can learn to deal with it.
When should we run these experiments as
an integral part of our software development process, is the answer. And this
question when should we run these tests?
Is not new and used to arise with the unit test integration
tests as well. But over time we
and the industry have started to run these best before we
check in code or in the case of integration tests
before we merge or cut a new version of our system.
So I look at an experimentation as
let's say real end to end integration test based on production
situations, conditions that we want to be resilient
to, that we would like to be able to handle or mitigate with
this described approach of creating experiments
and running them in a preproduction environment. We test the
reliability of our systems and this results
in a list of experiments. So we are creating continuously new experiments
that we can integrate into our deployment process and run
them continuously after a new deployment to
check how big is the risk we are taking? And is the
system still able to survive this past incident or not?
Or is it able to survive a specific scenario?
And yeah, to make it an integral part of your development process,
you can trigger the experiment execution via API or via
like a GitHub action that automatically starts
one or more experiments after running a new deployment in a pre
production stage. And after the execution, you will get a
list of, let's say ten experiments. They are executed,
you will get a result back. Five of them are successful,
maybe five of them have failed, and now you can handle the risk
before you go into production.
Let's recap. Let's summarize once again.
Failures are normal and they will always occur.
It's how we handle them and how we build and operate our systems that
makes the difference. Just sitting there
and waiting until the next error or failure occurs in production is not
a strategy. We need to
proactive, deal with possible failures and
test what impacts they have on our system and of course
on our customers. So embrace failures and turn them
into experiments to understand how your system is
reacting under such conditions.
And as said, chaos engineering is necessary. And with
our tool steadybit, there's a very easy
entry how you can get started with chaos engineering to improve
your system. And if you have any questions, please reach out to me.
And yeah, thanks a lot for having me and enjoy the conference.