Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Hello. My name is Maria and
today I'm going to talk about architecting for resilience
strategies for fault tolerance systems.
It is important to mention before we are going into details
that it's not enough to design the
system, you also need to maintain it and
to have proper tools to maintain it properly.
Otherwise ever if you are doing best possible system
design,
your system still might fail. So today in
my talk, I will touch both
those aspects. Yeah.
So let's go into detail then and
first thing, let's review type of failures that
we might face while designing while dealing with fault tolerance systems.
First and most obvious one, hardware, it's like
is hard drive is failing or something like this.
Second is network. It is like when our servers are fine
but connection between them or to them are not good.
Another thing is human error and
this is happening quite often.
I would say also any possible
natural disasters like tsunami,
flood or lighting, it might looks
like low probability, but depends of our
requirements. We might need to take care about it also.
And another thing is security breaches.
Okay, so we define type
of failures and after that what
I suggest to think about is requirements.
And here we need to ask ourselves a
question and answer to it honestly.
Do we really want to protect ourselves from all possible
failures that's going to happen?
Do we really need fault tolerance system?
Because building, because in ideal
world when we have unlimited amount of resources,
unlimited amount of time,
unlimited amount of engineers to work on our system,
we can design really perfect system, but in real world those
things are limited. And so
we will need to prioritize what
type of failures we are going to take care of first,
what we want
from our system and is it really required and
is it really worth it from business perspective?
Also, one thing I suggest to do from the
context of system requirements is just list
all the system features and prioritize
them against each other. Like let
me give you an example here. Like for example,
let's say we have online shop and if our customers
can't like online shop, food delivery and if
our customers can't check out, it's probably a
big problem. People can't order food, business is just
stopped. But for example,
if search is
not working or if suggestions for search
is not working, is it the
same severity of the problem compared to
checkout problem. And I
suggest to range this and it
will help to prioritize the effort and understand
which part of the system we need to duplicate and
which is not because as I said, we don't have unlimited resources
and defining our requirements is important here okay,
now let's talk a bit about tools.
And first thing is monitoring.
Monitoring is important because without monitoring
you wouldn't be able to understand if your system is
actually working or not. And you
don't only want monitoring, you have automatic detection
there. And it
means that you will need to set up threshold, deal with false
positives and so on and so on. It actually
might take time, but if
we want to have fault tolerance systems, we need automatic
detection because we can't just sit whole day and night and
watching on the dashboards. Next thing
is testing. Testing is important.
Everybody understands this, but I think it's worth to repeat.
And if we are talking from about fault tolerance
system from that perspective, there are
several things I like to mention. Like first is testing environment.
We need testing environment as identical
to production as possible. Like some small copy of
the cluster or something like this.
Because we don't want to touch production,
but we want to do our testing
as have run
all our test environment as close to production as possible.
I would say invest into test environment. Another thing is stress
testing. And while
ET is from fault tolerance, it's also important. And like
for example, if we are talking about online shop
again and if we are talking about Black Friday, we need to make
sure that our system will withstand all
the load and stress. Testing is something that can
identify in and define different bottlenecks.
And another thing I found, it's rarely
mentioned, but it's worth to mention.
Test your tools. Test that your monitoring
system is actually monitoring, test that your recovery
script is actually working.
Your tools is something that what you
will rely on. And yeah,
invest into your tools. In this case we are going to the next point like
about tools and I would say here,
invest into your tools and invest into
automation. Automate as many
things as possible. And of course think
about auto recovery and how you
can deal if, I don't know, database is corrupted and
you need to switch to the backup. Can it be automated?
How it can be automated? Invest into
this and invest into runbooks and documentation.
Documentation is important in general, but runbooks
is something that when you have
a disaster going on and you need to fix system ASAP,
you don't really want time to spend looking
for some specific command, trying to recall how
many parameters for this specific function.
Or if you have new engineer joined recently.
Yeah, they might not know all the details.
And when you need to do something, a sub run books
is helpful.
And next thing is chaos engineering.
Chaos engineering is a practice when you
are injecting some errors
and failures in the system to see how it will recover.
And all big companies doing this and they
are doing this for one reason, because it is useful.
It helps to identify some problems that was
not obvious before and fix it.
And because cows engineering is
like controlled process.
For example, if you're doing something that causing your system to
fail, you can roll back it. Roll it back.
Let me give you an example. If you have a system that
distributed among three data centers but designed
to work with two data centers,
good test can be is if one data
center is disconnected and check
if system is working. This is like
one of the examples that can be done.
And next thing is incremental rollout.
This thing is going to protect us if we have some problems
with our build in the
software. Because idea
there is to roll out to a small
percentage of the users and
wait for some time depends on your system
and on feedback. And if
all the metrics are okay, roll out
more to other people, to other users and
rollout will take longer in this case.
But if we have some critical bug in
our system, it will save us on it. It will be detected on
smaller amount of people and system would not go down fully.
Okay, so we've talked about tools
and processes and let's talk about system design.
And here are some concepts that you
should keep in mind when designing fault
tolerance system. And first
of all architecting. Make sure
it is modular and that it
is distributed. Like for example,
if we are talking about different data centers, system should be designed
in the way too, in specific way and that it
is scalable. Yeah, our favorite Black Friday example.
Another thing to keep in mind is
dependency management and single point of failure.
And here I would suggest such exercise
that after you will finish all system design,
draft design for your system,
look on it and ask yourself what's going
to happen if each of the components are going down.
Like what's going to happen if database down, what's going
to happen is this part is down. What's going to happen
is that process is going down.
And the answer should be that it
would be okay. If the answer is system
all system is going down, it means you probably will
need to find something
and to think of some possible ways of recovery
or duplication there or something like
this. So yeah,
make sure that all your dependencies,
all dependencies of your system are taken
care of. And I'm not only talking about internal
dependencies, I'm also talking about external.
Because for example,
if you're using some external API,
think what's going to happen is this API started
to return errors. Are your system
still going to work or not?
And yet take care and design you and try to design
your system in a way that it wouldn't fail.
Another concept is graceful degradation and
failover. And failover is then we
are talking that some process is down.
We have duplicate of this process that is taking it
and serving the request.
But graceful degradation is quite
interesting. I find as quite interesting concept. It's when
some parts of your system are failing,
users still able to use it. They just might have
reduced experience, user experience.
And like for example, if your
system is running out of memories and request
is too heavy, you might consider to
provide for your system that your system will at that point
like return some simplified version of the request and
it still will work. Maybe, I don't know,
results would not be as nice,
but still a system will still work and thing
what to pay attention to here
is like adaptive UI.
If your system is partially working.
Like for example, if we are talking about
online shop and for each item you
have button like find similar that is showing similar items.
If this feature is not working, consider what you want to do with this button.
Do you want to gray it out? Do you want to
hide it? And what
do you want to show to your users if some parts
of the system is not working?
Okay, next concept
is diversity.
Usually like running different systems, different technologies,
different clouds might be a problem because
it is more maintenance. But if we are
talking about fault tolerance, it might
be beneficial. And let me give you an
example. We have some company and
we have internal resources and internal chat for that company.
But for some reason internal resources going down,
what's going to happen? People and employees
would not be able to communicate with each
other to actually recover system. In this
case it is important to have
some backup channel and to
make sure that employees can communicate
and prepare recovery plan and to restore the system.
And for example, some other example is that if
you're running your servers in your system in some cloud,
maybe you want to duplicate something in another one,
in another type of cloud. I mean not only another data center.
Yes, it will give you additional overhead,
but it will protect you from
the cases, some edge cases, and this specific cloud
will go down. Okay,
and here's redundancy. Probably main
topic for fault tolerance system. It's like
redundancy is like duplication of the part of the system
to withstand any problems.
And here, let's see what kind of redundancy
we might have so we
can have hardware redundancy. It's like raid
well known or hot standby. It's like when
we have another server totally
identical to the first one.
If first going down another second one,
taking all the work software,
we can have like process replication.
Again, if process is going down, another one is serving the
request and we have inversion programming
like if we have critical system and this
critical system might be implemented in several
ways using different programming language or something
like this. And when we need to get
the output of this system we are actually comparing the results of
all the implementations. And if it's not identical
it means somewhere there is an error, somewhere.
Next thing is network. It's like some
backup network paths. If main thrust is going down,
we switch into the backup data.
Mirroring is when we have
a database identical to the first
one with all the data
in sync and the first goes going down, we switch into the second
one and data backups, of course backups
as you're always useful and geographical, of course we can
have geographical redundancy, it's multidimensional
center deployment. Or maybe you ever want
to deploy it in different countries,
it's up to you. Remember about tsunami
and here in the end I wanted to
mention a few cases that happened,
really happened. And that I found really interesting because it
is something that you think what's the probability of this?
Is it really going to possible? And actually yes.
When we are talking about fault tolerance systems,
sometimes we need to think about very
low probable problems things and
like first is Google. On 13th
August 2015, four successful
lighting strikes hits Google
Data center. And another example
is Twitter and Instagram both had some
problems on July 14,
2022. But for
both of totally unrelated reasons,
totally independent systems still
goes down. So what's the probability of
this? It is pretty low, but it's happened.
And something that happened recently is
that a lot of descriptions of items in Amazon been replicated
with. I apologize but I can't fulfill this request.
It happened just a few days ago and it happened because
people automated descriptions of their
items using chat GBT. But yeah,
external APIs. It is something we
talked about. It's been mentioned.
Okay, so I will stop here.
I hope it was useful for you.
Thank you for your time,