Transcript
This transcript was autogenerated. To make changes, submit a PR.
I want to share with you some stories, some findings, some experience
of mine about how to make the application, the system more available.
How to improve its characteristics on several layers and why it is important
to know about each of these layers and know how to incorporate them into
your work as a platform engineer.
Let me introduce you my talk.
Many layers.
Of availability or how to achieve so many nines in practice.
And as we are platform engineers, the only thing we care about.
Is nines.
Let me start.
Many people, while talking about achieving nines, talk about microservices.
Let's migrate to microservices, let's start with microservices.
And they solve this issue often not.
Because it's, sometimes, it's a little bit more compli complicated than that.
It is like Saga, works well in theory, but often works poorly in practice.
Because world, real world, is just more complex game.
Place and books.
So I want to share what I found during my work in some very
interesting projects And who am I?
My name is Ilya Kozachev.
I'm a cloud native application architect at Capgemini and previously I worked
at MTS cloud, Deutsche Telekom video and some other interesting companies.
I'm a Google developer expert on cloud and I work for some time
in software engineering, software architecture, platform engineering
and public cloud building.
And one of my recent projects was building a public cloud provider from scratch.
For example, a Kubernetes as a service is a core part of this project.
And I learned a lot about availability and reliability and many layers that
we often don't see because we don't have a direct access to them, but
it's very important to know them, to understand them and understand
how to design your applications, thus that they will be successful.
include features or rely on features of those layers.
So let me share it with you.
But what about NINES?
What about NINES?
Let me start with application then and tell you what about NINES.
When we're talking about applications we always talk about patterns, approaches,
architecture, standards, et cetera.
So let's assume we have poorly designed.
monolithic spaghetti code application and we want to make it highly available.
We will start with re architecting this application and moving parts of this
spaghetti code to separated spaghetti balls and each one We'll only fail and
not bring down the whole application.
On the other hand, if something fails in spaghetti code monolith,
it is the single point of failure and the whole application will fail.
So we want to avoid a single point of failure first and design the application
that way that if something breaks.
The number of stuff broken is very limited, and we want to have this
number of broken stuff very low.
Next is redundancy.
This is our main tool, achieving higher availability and higher reliability.
It's that simple.
Humanity still not invented something better.
So we have to run more replicas together.
More pieces of the same application so that if one of these replicas
fail, we don't have a big issue.
Okay, we can redistribute the load to the rest of the replicas and it will be fine.
Second, it will help to process requests.
So one application can be overloaded, but when we have
multiple similar applications, we can distribute load among them.
And be sure that it's enough capacity to process such a load.
And if it's not enough capacity, we can scale.
We can add more replicas.
And while our system already can work with several replicas, we can put
more replicas if needed, or remove replicas if we don't need them.
If you have Black Friday, we You can increase number of replicas, but if you
have I don't know, some summer vacation time, you can remove replicas because
people maybe don't buy from your store and you don't have to pay your AWS bills.
But to make it possible, you also have to deliver requests.
To this replica somehow, and it is done by means of load balancer.
So this tool distributes requests between different replicas.
It can just circle through the list of addresses, or it can implement some
smart balancing algorithms, like to distribute more or less requests based
on the load of CPU load, for example, of each of the replicas, or something else.
And it also will allow you to implement dynamic scaling, so you can put more
replicas or remove replicas dynamically again based on CPU load, for example,
add load balancer will automatically add addresses of new replicas.
to the load balancing list or remove addresses of removed replicas to
avoid sending requests to nowhere.
Next, we want to reduce dependencies between applications,
between different modules.
And thus they interact only via APIs and not via shared
database, not via something else.
So that if we change something, We can do that transparently, and we are sure
that our change will not break other systems by surprise, because we know
the dependency between two systems.
They are clearly defined as an API.
Second, we want to eliminate implicit dependencies I don't know,
something, strange logic in several applications, or using of libraries.
With our some internal variables, internal magic variables or something like that.
Sometimes it's useful, but it has to be limited and you have to understand that
if it's variable thing and you just want to put your standard values to somewhere
it's okay but if you want to share code between applications share logic between
them be very careful because when this shared code will evolve it may bring some
unexpected problems to your applications.
And next, try to consolidate features together.
If features belong to one topic, one domain, put them together,
put them into one microservice, put them into one module.
It also helps to avoid problems.
problems, avoid miscommunication between modules, avoid evolutionary problems when
your modules grow, when your applications grow, changing, and if you have features
of one domain into several applications, they will grow closer and bring you
a thing called distributed monolith, which will decrease the availability.
Because bringing more errors and problems, so try to avoid that.
Try to use domain driven design, for example.
And the last but not least, Synchronous VPC.
Asynchronous like HTTP requests are okay, nothing bad with them.
But if you have chain of requests, one error can break the whole chain,
and you can have a cascading error.
That's not very good.
And thus people use asynchronous communication via message brokers,
for example, they can be message based when you pretty similar to HTTP
request, you send a message, you send a request, and you expect to receive it.
And answer, but you don't wait for that request stopping your work.
You continue working, do something else, but, and then when the request is
there, you can start processing that.
The application can even restart during that time between, and
there is no problem at all.
Second is even best.
The difference is it's one directional, it's fire and forget.
Forget, you only send the message and you as a sender
don't care what will happen next.
Maybe it will be processed, maybe it will be processed by multiple
applications, or maybe no one cares and it doesn't matter for you as a sender.
So this is the first layer of our cake, the application.
Let's see what we have next.
The platform for every application is very important where it is running, and
it's always running within some kind of infrastructure based on platform, virtual
machines, hardware, et cetera, et cetera.
And without stable and reliable platform, your whole number of efforts toward
higher availability on the application layer will cost nothing because the
unreliable platform will fail you.
Let's take a look.
First, cluster architecture.
A lot of platforms are built in the cluster manner like
Kubernetes, Kafka, Hadoop, etc.
Because It helps to prevent outage.
If one node in cluster will fail, cluster can very easily red redistribute the load
between other nodes while it restarts or recreates the failed node, and there
will be no, or almost no problem for the avail availability of your applications.
If your application is running in several replicas on top of the Kubernetes
cluster, for example, there will be.
No issue for your application if one node with one replica fail,
if you use autoscaling, of course.
Second is load distribution.
Cluster also schedules applications that way to evenly distribute
load between the nodes, so there will be no outages when load.
Our over overflows with tasks is work, and it also helps to avoid
some not very pleasant surprises in production and next network replication.
Again, Kubernetes example, each worker node is a gateway and network gateway
to internal Kubernetes cluster network.
So when you send a request.
Node 1 may take this request and then redirect it to the node
2 where your port is running.
And when node 1 fails, there is no issue because node 2 can also
be a gateway and it is a gateway.
To the network and node three is a gateway and you can load balance
the request to any node and that's what's happening in any environment.
Next orchestration.
So we started talking about orchestrators.
Let's go a little bit further.
Orchestrations are very nice because they help to control the
life cycle of the applications.
Yeah, you basically can put the application into the empty virtual machine
with systemd or something like that.
It will even restart the application when it crashes, but Orchestrator
helps to do a lot of stuff.
For example, controlling readiness of the application auto scaling
application, restarting when it fails, of course, and more and more.
Next, auto scaling.
As we said a couple of times, it's a very important thing when based on
metrics of your application, for example, CPU or number of requests and common
requests or number of consumed messages, the orchestrator can also scale your
application by putting more replicas or removing some replicas if there are
no need no need for such a capacity, which is very useful if you have spikes
in your load and you don't want to lose clients, for example, if you can't
process their requests and One more very nice thing is anti affinity, when
a cluster cares that your replicas are, if possible, running on different nodes.
If one node fails, there is less likely that several replicas of your application
run on this node, and this will have a big impact on your application availability.
This is also not so much achievable without the
orchestration.
And the last very nice thing here is self healing.
When the machine, virtual machine if something goes wrong and it goes down,
It can be restarted on the platform layer or on the infrastructure layer
because the infrastructure monitors its health by means of health checks and load
balancer for example does health checks to understand is your node running, is
it healthy, can it send requests to that node, or maybe avoid sending applications
to a dead node so they will not be lost.
And there are also health monitoring inside to see how node is going
and is it okay or something wrong or maybe it's time to restart the
node or stop receiving the requests.
And it's very useful because we always work in dynamic environment and even
like Cloud infrastructure is not 100 percent invulnerable for failures and
it fails from time to time because underlying hardware fails and software
on top of that hardware fails and thus it works in some dynamics.
Something breaks, something restarts, something being recreated and
such tools help to do this a lot.
That's the second layer.
Third layer.
Is the data we often forget about the data when we're talking
about availability and stuff.
We only talk about the application.
It's not 100 percent correct.
In fact, data is a key for the application.
Data is a key for almost everything.
Our work as a software engineers as a platform engineers.
And let's talk a little bit about data.
Data.
Again, Redundancy is the most beloved tool also in the data area.
We do database replicas.
So we run several applications, several instances of the database, and we in
the runtime, in the real time, share the changes between these replicas so that
if one replica fails, we still able to, uh, redistribute the request to the next
replica and everything will be okay.
Second.
Message broker replicas work similar, but a little bit differently, but the
thing is, we have multiple brokers, for example, in Kafka cluster, and if one
broker fails, cluster will continue to work, your topics will continue to work,
and, yeah, it's achieved by redundancy.
Backups, so we have Real time replicas, we also have archived replicas, so that if
the whole data center will burn, we still can restore the data from this backup.
And the second thing is caching.
Sometimes services you rely on are not available.
By some reason, maybe they are down, maybe the network connection
to them is unstable, so what to do?
We have to save some data for later to avoid this kind of cascade outages.
And first thing is application caching.
You can just save the data in the application memory and use it next
for faster access or if the underlying service is not available anymore.
Second is database caching.
For many database engines you can specify caching on the database layer, again
for either faster access or maybe to collecting data from several instances
and avoiding single point of failure.
Standalone caching, like having a Redis instance for example.
It also helps To avoid one service failing and you are not
able to get the data from there.
There are many more options, but I want to cover the basics.
And of course we have to talk about content delivery networks, CDN.
The replicas where you hold backend is not needed.
You can get data from their network.
Of caches around the globe.
Normally it's often more about front end and data media content, but
it's still very useful technique and it makes sense to keep in
mind that it can save your system.
from big outage.
For example, if your startup starts in your campaign, you may expect
maybe a thousand of people hitting your index page after some successful
post in social media or something and your application may not be posted.
And CDN can actually save you by redirection requests to save copies
of your index page basically.
And that was the third layer.
The next one is infrastructure.
The virtual machines.
The networks and the disks, the beloved trio basis of everything that we have.
So why is it important?
Because you also have to know the features and what they can bring
you, or maybe what it can mean.
For example, for compute in many platforms, in many cloud providers like
public or private cloud providers, like VMware, there is a hot migration option.
That means that if the host where your virtual machine is running fails,
your virtual machine will, with its memory, its state, will be restarted,
will be started on another host from the same or almost same host.
Place almost same operation position in memory and will continue
to work almost without change.
It's also a very nice feature, but it may bring some surprises
if you don't know about it.
And instance groups.
It's also a feature for managing multiple virtual machines with
the same configuration, like auto scaling them, updating them,
managing their network, etc.
And it also helps us to improve availability.
For example, by auto scaling the fleet of machines where you run your
application or platform or something.
Second is networking.
Almost always.
Software defined network is used over commodity machines instead
of some hardware switches, hardware, something it's used
also, but not for these purposes.
And it also has a lot of built in availability features inside for
example, Gameware and Asics tools.
They have a lot of very useful techniques.
that you have maybe no idea about, but they help to, they help overland
platforms, for example, to work smoothly and to virtual machines to
communicate without interruptions, even if something goes wrong.
Load balancers.
Also, we almost never use metal load balancers, we use software defined load
balancers, and as any other application, load balancer can be a single point
of failure, and we want to avoid that.
We always have multiple load balancers, balancing the load, distributing the load.
Yeah, know about that, think about that.
It also may bring some complexity, but it normally brings us a lot of benefits.
Reliability to our system.
And next, the disk, the storage.
We always replicate stuff.
We do redundant things.
We like that.
And, yeah, we have to replicate storage.
For example, if you save some binary file, this file being spread into
several smaller chunks and each chunk will be copied into several machines
to avoid that one machine will fail and you will lost this chunk and
you cannot Collect this file again.
So for example, this is how theft works.
It puts different chunks into different machines.
Each chunk is copied two or three times.
And if machine fails, you still, you can still recreate the initial data.
From the rest of the machines.
And there will be no issue with that.
And network file system is used to attach disk to a machine without direct
physical connection between them.
And this, for example, how Kubernetes mounts, mounts the disks to nodes.
And if node fails, you won't lose the data on the disk.
It Kubernetes will just remount it to another node.
Okay.
and run the pod there and you can continue using this data.
This was fourth level of our cake, our ability cake.
Let's go to harder level, hardware.
And we don't have so many Fancy stuff on the hardware level.
It's definitely a lot of stuff there, but I'm not going to go deeper.
I just want to mention a couple of things.
First rate again replication.
Unfortunately, these hard drives are not reliable.
And any other drives are not reliable and we cannot trust them.
So we have to put several disks together and again, split data into
tiny chunks and redistribute copies of these chunks between these disks.
If one disk fail.
We still can reconstruct the initial data from the rest of the disks.
Second, rack distribution.
Sorry.
We can expect that if we have all of our servers in one
rack, something can go wrong.
It can burn it can be electricity outage, it can be fan outage,
and it will stop working.
So we want to have our servers in different racks and some applications
like Kafka, for example, has rack awareness feature when you can
specify which Brokers run on rich racks, other run together or not.
So Kafka will manage the data replicas that way to minimize the possibility of
the replicas to host on some same rack.
So if the whole rack goes down, Kafka will be able to still continue operating.
And that means that different levels are interconnected.
And interdependent and you also have to understand this dependency to not
only use underlying infrastructure, but also leverage its features.
And next is internet channel backup.
So in data center, it's always more than one internet provider because one internet
provider is a single point of failure.
A single physical cable is a single point of failure, that can be an angry
excavator or something like that.
And it happens again and again, surprisingly.
And here we have our whole cake, but there is still a room for an icing on
top, which is a global availability.
Basically, all of our applications running in pretty stable environment, but
something may happen with the data center.
For example, it can burn, it can crash, like I don't know, meteorite
can crash into data center, or just a tsunami, or something else.
And if It's important.
And in many cases, it is you can have a several data center with a
standby disaster recovery replica of your whole architectural landscape.
And if something goes wrong with your main data center, immediately you switch to the
second data center and continue working.
It requires replicating all data constantly and doing.
A lot of training, but it works in all the cases and more interesting case is a multi
zone regions in cloud providers when if one region have has more than three, three
or more zones, it is disaster tolerant because It's a dangerous situation.
If one zone fails, it's the region and its services will continue to work.
For example, Kubernetes having nodes in three zones will lose
nothing, only some capacity.
But the applications will continue to work.
The databases will continue to work.
The Kafka brokers will continue to work and so on.
And one more interesting thing is global server load balancer or GSLB.
Sometimes you even rely on a single, the single region.
Or a single provider or a single city.
Something may happen with a city or even with a country.
And for some businesses, it's crucial to be available, like, all of the time.
In that case, you may want to have your application running in different
regions, maybe in different countries, maybe on different continents.
How to distribute traffic between those regions?
That's very difficult, but there is a solution, JSLB, it works based on
BGP Anycast protocol giving you an ability to access the nearest endpoint
to you, the nearest host to you.
For example, if you are in Japan, and there are, Several regions regional
installations, one in China, one US, the nearest will be in China.
But if you are in Mexico, the nearest will be, In US, and it works automatically.
And it also gives the ability to do multi region load balancing.
And so for example, if your application is very load heavy, you can also spread the
load between different regions, different locations, and Again, in turn, it will
improve your application performance and your application availability.
One example of such an application is a YouTube.
So
we have our whole cake, we have an icing on top.
Let me sum it up.
First, availability is made up of many layers, levels.
Second, each level must cooperate with other levels, as we've seen.
And third, we need to consider all levels when developing highly
available applications, so they can leverage different availability
features of underlying levels.
And it's all thanks for your attention and feel free to find me
on LinkedIn, or if it's happening, that will be in the same city, I'd
be happy to drink a coffee with you.
Thank you.