Many Layers of Availability

Video size:

Abstract

Discover the true path to building reliable systems beyond microservices. Join my talk to explore the interconnected factors of availability, from infrastructure to architecture. You’ll leave with a comprehensive roadmap for achieving high availability in advanced cloud-native systems.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

I want to share with you some stories, some findings, some experience of mine about how to make the application, the system more available. How to improve its characteristics on several layers and why it is important to know about each of these layers and know how to incorporate them into your work as a platform engineer. Let me introduce you my talk. Many layers. Of availability or how to achieve so many nines in practice. And as we are platform engineers, the only thing we care about. Is nines. Let me start. Many people, while talking about achieving nines, talk about microservices. Let's migrate to microservices, let's start with microservices. And they solve this issue often not. Because it's, sometimes, it's a little bit more compli complicated than that. It is like Saga, works well in theory, but often works poorly in practice. Because world, real world, is just more complex game. Place and books. So I want to share what I found during my work in some very interesting projects And who am I? My name is Ilya Kozachev. I'm a cloud native application architect at Capgemini and previously I worked at MTS cloud, Deutsche Telekom video and some other interesting companies. I'm a Google developer expert on cloud and I work for some time in software engineering, software architecture, platform engineering and public cloud building. And one of my recent projects was building a public cloud provider from scratch. For example, a Kubernetes as a service is a core part of this project. And I learned a lot about availability and reliability and many layers that we often don't see because we don't have a direct access to them, but it's very important to know them, to understand them and understand how to design your applications, thus that they will be successful. include features or rely on features of those layers. So let me share it with you. But what about NINES? What about NINES? Let me start with application then and tell you what about NINES. When we're talking about applications we always talk about patterns, approaches, architecture, standards, et cetera. So let's assume we have poorly designed. monolithic spaghetti code application and we want to make it highly available. We will start with re architecting this application and moving parts of this spaghetti code to separated spaghetti balls and each one We'll only fail and not bring down the whole application. On the other hand, if something fails in spaghetti code monolith, it is the single point of failure and the whole application will fail. So we want to avoid a single point of failure first and design the application that way that if something breaks. The number of stuff broken is very limited, and we want to have this number of broken stuff very low. Next is redundancy. This is our main tool, achieving higher availability and higher reliability. It's that simple. Humanity still not invented something better. So we have to run more replicas together. More pieces of the same application so that if one of these replicas fail, we don't have a big issue. Okay, we can redistribute the load to the rest of the replicas and it will be fine. Second, it will help to process requests. So one application can be overloaded, but when we have multiple similar applications, we can distribute load among them. And be sure that it's enough capacity to process such a load. And if it's not enough capacity, we can scale. We can add more replicas. And while our system already can work with several replicas, we can put more replicas if needed, or remove replicas if we don't need them. If you have Black Friday, we You can increase number of replicas, but if you have I don't know, some summer vacation time, you can remove replicas because people maybe don't buy from your store and you don't have to pay your AWS bills. But to make it possible, you also have to deliver requests. To this replica somehow, and it is done by means of load balancer. So this tool distributes requests between different replicas. It can just circle through the list of addresses, or it can implement some smart balancing algorithms, like to distribute more or less requests based on the load of CPU load, for example, of each of the replicas, or something else. And it also will allow you to implement dynamic scaling, so you can put more replicas or remove replicas dynamically again based on CPU load, for example, add load balancer will automatically add addresses of new replicas. to the load balancing list or remove addresses of removed replicas to avoid sending requests to nowhere. Next, we want to reduce dependencies between applications, between different modules. And thus they interact only via APIs and not via shared database, not via something else. So that if we change something, We can do that transparently, and we are sure that our change will not break other systems by surprise, because we know the dependency between two systems. They are clearly defined as an API. Second, we want to eliminate implicit dependencies I don't know, something, strange logic in several applications, or using of libraries. With our some internal variables, internal magic variables or something like that. Sometimes it's useful, but it has to be limited and you have to understand that if it's variable thing and you just want to put your standard values to somewhere it's okay but if you want to share code between applications share logic between them be very careful because when this shared code will evolve it may bring some unexpected problems to your applications. And next, try to consolidate features together. If features belong to one topic, one domain, put them together, put them into one microservice, put them into one module. It also helps to avoid problems. problems, avoid miscommunication between modules, avoid evolutionary problems when your modules grow, when your applications grow, changing, and if you have features of one domain into several applications, they will grow closer and bring you a thing called distributed monolith, which will decrease the availability. Because bringing more errors and problems, so try to avoid that. Try to use domain driven design, for example. And the last but not least, Synchronous VPC. Asynchronous like HTTP requests are okay, nothing bad with them. But if you have chain of requests, one error can break the whole chain, and you can have a cascading error. That's not very good. And thus people use asynchronous communication via message brokers, for example, they can be message based when you pretty similar to HTTP request, you send a message, you send a request, and you expect to receive it. And answer, but you don't wait for that request stopping your work. You continue working, do something else, but, and then when the request is there, you can start processing that. The application can even restart during that time between, and there is no problem at all. Second is even best. The difference is it's one directional, it's fire and forget. Forget, you only send the message and you as a sender don't care what will happen next. Maybe it will be processed, maybe it will be processed by multiple applications, or maybe no one cares and it doesn't matter for you as a sender. So this is the first layer of our cake, the application. Let's see what we have next. The platform for every application is very important where it is running, and it's always running within some kind of infrastructure based on platform, virtual machines, hardware, et cetera, et cetera. And without stable and reliable platform, your whole number of efforts toward higher availability on the application layer will cost nothing because the unreliable platform will fail you. Let's take a look. First, cluster architecture. A lot of platforms are built in the cluster manner like Kubernetes, Kafka, Hadoop, etc. Because It helps to prevent outage. If one node in cluster will fail, cluster can very easily red redistribute the load between other nodes while it restarts or recreates the failed node, and there will be no, or almost no problem for the avail availability of your applications. If your application is running in several replicas on top of the Kubernetes cluster, for example, there will be. No issue for your application if one node with one replica fail, if you use autoscaling, of course. Second is load distribution. Cluster also schedules applications that way to evenly distribute load between the nodes, so there will be no outages when load. Our over overflows with tasks is work, and it also helps to avoid some not very pleasant surprises in production and next network replication. Again, Kubernetes example, each worker node is a gateway and network gateway to internal Kubernetes cluster network. So when you send a request. Node 1 may take this request and then redirect it to the node 2 where your port is running. And when node 1 fails, there is no issue because node 2 can also be a gateway and it is a gateway. To the network and node three is a gateway and you can load balance the request to any node and that's what's happening in any environment. Next orchestration. So we started talking about orchestrators. Let's go a little bit further. Orchestrations are very nice because they help to control the life cycle of the applications. Yeah, you basically can put the application into the empty virtual machine with systemd or something like that. It will even restart the application when it crashes, but Orchestrator helps to do a lot of stuff. For example, controlling readiness of the application auto scaling application, restarting when it fails, of course, and more and more. Next, auto scaling. As we said a couple of times, it's a very important thing when based on metrics of your application, for example, CPU or number of requests and common requests or number of consumed messages, the orchestrator can also scale your application by putting more replicas or removing some replicas if there are no need no need for such a capacity, which is very useful if you have spikes in your load and you don't want to lose clients, for example, if you can't process their requests and One more very nice thing is anti affinity, when a cluster cares that your replicas are, if possible, running on different nodes. If one node fails, there is less likely that several replicas of your application run on this node, and this will have a big impact on your application availability. This is also not so much achievable without the orchestration. And the last very nice thing here is self healing. When the machine, virtual machine if something goes wrong and it goes down, It can be restarted on the platform layer or on the infrastructure layer because the infrastructure monitors its health by means of health checks and load balancer for example does health checks to understand is your node running, is it healthy, can it send requests to that node, or maybe avoid sending applications to a dead node so they will not be lost. And there are also health monitoring inside to see how node is going and is it okay or something wrong or maybe it's time to restart the node or stop receiving the requests. And it's very useful because we always work in dynamic environment and even like Cloud infrastructure is not 100 percent invulnerable for failures and it fails from time to time because underlying hardware fails and software on top of that hardware fails and thus it works in some dynamics. Something breaks, something restarts, something being recreated and such tools help to do this a lot. That's the second layer. Third layer. Is the data we often forget about the data when we're talking about availability and stuff. We only talk about the application. It's not 100 percent correct. In fact, data is a key for the application. Data is a key for almost everything. Our work as a software engineers as a platform engineers. And let's talk a little bit about data. Data. Again, Redundancy is the most beloved tool also in the data area. We do database replicas. So we run several applications, several instances of the database, and we in the runtime, in the real time, share the changes between these replicas so that if one replica fails, we still able to, uh, redistribute the request to the next replica and everything will be okay. Second. Message broker replicas work similar, but a little bit differently, but the thing is, we have multiple brokers, for example, in Kafka cluster, and if one broker fails, cluster will continue to work, your topics will continue to work, and, yeah, it's achieved by redundancy. Backups, so we have Real time replicas, we also have archived replicas, so that if the whole data center will burn, we still can restore the data from this backup. And the second thing is caching. Sometimes services you rely on are not available. By some reason, maybe they are down, maybe the network connection to them is unstable, so what to do? We have to save some data for later to avoid this kind of cascade outages. And first thing is application caching. You can just save the data in the application memory and use it next for faster access or if the underlying service is not available anymore. Second is database caching. For many database engines you can specify caching on the database layer, again for either faster access or maybe to collecting data from several instances and avoiding single point of failure. Standalone caching, like having a Redis instance for example. It also helps To avoid one service failing and you are not able to get the data from there. There are many more options, but I want to cover the basics. And of course we have to talk about content delivery networks, CDN. The replicas where you hold backend is not needed. You can get data from their network. Of caches around the globe. Normally it's often more about front end and data media content, but it's still very useful technique and it makes sense to keep in mind that it can save your system. from big outage. For example, if your startup starts in your campaign, you may expect maybe a thousand of people hitting your index page after some successful post in social media or something and your application may not be posted. And CDN can actually save you by redirection requests to save copies of your index page basically. And that was the third layer. The next one is infrastructure. The virtual machines. The networks and the disks, the beloved trio basis of everything that we have. So why is it important? Because you also have to know the features and what they can bring you, or maybe what it can mean. For example, for compute in many platforms, in many cloud providers like public or private cloud providers, like VMware, there is a hot migration option. That means that if the host where your virtual machine is running fails, your virtual machine will, with its memory, its state, will be restarted, will be started on another host from the same or almost same host. Place almost same operation position in memory and will continue to work almost without change. It's also a very nice feature, but it may bring some surprises if you don't know about it. And instance groups. It's also a feature for managing multiple virtual machines with the same configuration, like auto scaling them, updating them, managing their network, etc. And it also helps us to improve availability. For example, by auto scaling the fleet of machines where you run your application or platform or something. Second is networking. Almost always. Software defined network is used over commodity machines instead of some hardware switches, hardware, something it's used also, but not for these purposes. And it also has a lot of built in availability features inside for example, Gameware and Asics tools. They have a lot of very useful techniques. that you have maybe no idea about, but they help to, they help overland platforms, for example, to work smoothly and to virtual machines to communicate without interruptions, even if something goes wrong. Load balancers. Also, we almost never use metal load balancers, we use software defined load balancers, and as any other application, load balancer can be a single point of failure, and we want to avoid that. We always have multiple load balancers, balancing the load, distributing the load. Yeah, know about that, think about that. It also may bring some complexity, but it normally brings us a lot of benefits. Reliability to our system. And next, the disk, the storage. We always replicate stuff. We do redundant things. We like that. And, yeah, we have to replicate storage. For example, if you save some binary file, this file being spread into several smaller chunks and each chunk will be copied into several machines to avoid that one machine will fail and you will lost this chunk and you cannot Collect this file again. So for example, this is how theft works. It puts different chunks into different machines. Each chunk is copied two or three times. And if machine fails, you still, you can still recreate the initial data. From the rest of the machines. And there will be no issue with that. And network file system is used to attach disk to a machine without direct physical connection between them. And this, for example, how Kubernetes mounts, mounts the disks to nodes. And if node fails, you won't lose the data on the disk. It Kubernetes will just remount it to another node. Okay. and run the pod there and you can continue using this data. This was fourth level of our cake, our ability cake. Let's go to harder level, hardware. And we don't have so many Fancy stuff on the hardware level. It's definitely a lot of stuff there, but I'm not going to go deeper. I just want to mention a couple of things. First rate again replication. Unfortunately, these hard drives are not reliable. And any other drives are not reliable and we cannot trust them. So we have to put several disks together and again, split data into tiny chunks and redistribute copies of these chunks between these disks. If one disk fail. We still can reconstruct the initial data from the rest of the disks. Second, rack distribution. Sorry. We can expect that if we have all of our servers in one rack, something can go wrong. It can burn it can be electricity outage, it can be fan outage, and it will stop working. So we want to have our servers in different racks and some applications like Kafka, for example, has rack awareness feature when you can specify which Brokers run on rich racks, other run together or not. So Kafka will manage the data replicas that way to minimize the possibility of the replicas to host on some same rack. So if the whole rack goes down, Kafka will be able to still continue operating. And that means that different levels are interconnected. And interdependent and you also have to understand this dependency to not only use underlying infrastructure, but also leverage its features. And next is internet channel backup. So in data center, it's always more than one internet provider because one internet provider is a single point of failure. A single physical cable is a single point of failure, that can be an angry excavator or something like that. And it happens again and again, surprisingly. And here we have our whole cake, but there is still a room for an icing on top, which is a global availability. Basically, all of our applications running in pretty stable environment, but something may happen with the data center. For example, it can burn, it can crash, like I don't know, meteorite can crash into data center, or just a tsunami, or something else. And if It's important. And in many cases, it is you can have a several data center with a standby disaster recovery replica of your whole architectural landscape. And if something goes wrong with your main data center, immediately you switch to the second data center and continue working. It requires replicating all data constantly and doing. A lot of training, but it works in all the cases and more interesting case is a multi zone regions in cloud providers when if one region have has more than three, three or more zones, it is disaster tolerant because It's a dangerous situation. If one zone fails, it's the region and its services will continue to work. For example, Kubernetes having nodes in three zones will lose nothing, only some capacity. But the applications will continue to work. The databases will continue to work. The Kafka brokers will continue to work and so on. And one more interesting thing is global server load balancer or GSLB. Sometimes you even rely on a single, the single region. Or a single provider or a single city. Something may happen with a city or even with a country. And for some businesses, it's crucial to be available, like, all of the time. In that case, you may want to have your application running in different regions, maybe in different countries, maybe on different continents. How to distribute traffic between those regions? That's very difficult, but there is a solution, JSLB, it works based on BGP Anycast protocol giving you an ability to access the nearest endpoint to you, the nearest host to you. For example, if you are in Japan, and there are, Several regions regional installations, one in China, one US, the nearest will be in China. But if you are in Mexico, the nearest will be, In US, and it works automatically. And it also gives the ability to do multi region load balancing. And so for example, if your application is very load heavy, you can also spread the load between different regions, different locations, and Again, in turn, it will improve your application performance and your application availability. One example of such an application is a YouTube. So we have our whole cake, we have an icing on top. Let me sum it up. First, availability is made up of many layers, levels. Second, each level must cooperate with other levels, as we've seen. And third, we need to consider all levels when developing highly available applications, so they can leverage different availability features of underlying levels. And it's all thanks for your attention and feel free to find me on LinkedIn, or if it's happening, that will be in the same city, I'd be happy to drink a coffee with you. Thank you.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Many Layers of Availability

Video size:

Abstract

Summary

Transcript

Slides

Ilya Kaznacheev

Lead Cloud and Applications Architect @ Capgemini

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

Many Layers of Availability

Video size:

Abstract

Summary

Transcript

Slides

Ilya Kaznacheev

Lead Cloud and Applications Architect @ Capgemini

Join the community!