When Clusters Collide: The Misadventures of Disaster Recovery in Kubernetes

Video size:

Abstract

Join us for an in-depth exploration of Kubernetes disaster recovery, going beyond the basic setup of twin EKS clusters and DNS switches. This talk covers practical strategies, common challenges, and solutions, making it ideal for those with intermediate knowledge of Kubernetes, DevOps, and the Cloud.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

So let's talk what happens when clusters collide. If you search in any search engine, the R for Kubernetes or the R for AKS, you will feel like in the song, but life could be a dream. Everything is simple. You just deploy to Kubernetes clusters in two different regions. You set up some load balancer or a DNS in front of them. You may be copy some data from one cluster to another man. Everything is super simple, but from working in the last years with many companies, from startups to large enterprises. I experienced that nothing is simple in doing a DR for Kubernetes and it's actually pretty tricky. And hi everyone. I'm Lev, CTO at TerraSky and TerraSky is a global solution integrator. We are helping our customers to make sure their Kubernetes is secured, monitored, resilient, and maintainable. What we see is that Kubernetes provide us a lot of power, right? It's very self service focused. And it's super convenient for developers, but it comes at a price of how easy it is to set up DR, and there's a reason for it. If you take a look how things were in the past, someone who was responsible for the infrastructure would probably create the compute. The servers, the disks, the load balancers, configure DNS, configure, issue the certificates. So everything was provided for the developer and it was maybe more manual and slower, but the DR process was more straightforward. Whoever set up this environment knew how to actually do the entire failover and what is required to do that. Today, when the developers can actually ask for many of those things by themselves, they can get load balancers through Ingress. They can get disks through pvnpvc. They can issue certificates in the investor and we'll see how things become more complex when we're talking about DR. So let's talk let's set the stage, right? We have two AWS region. I will have two AWS region in this talk, EKS cluster, one EKS cluster in each of the regions, and we want to have disaster recovery. And I'm not going into the philosophy and the RTO, RPO, you decide what it means for you and what are you looking for, but I will just show you a couple of examples. So we will start with a simple one and a naive one that probably rarely happens in the reality. Let's say we have a simple stateless deployment, and this deployment is receiving no traffic from outside. It just runs, maybe processes something. Easy stuff. So I take this YAML of deployment of sample app, two replicas. This is our image. I will just deploy it to different regions. U. S. is one, U. S. is two. Two deployments, easy, right? we deploy the first cluster, we deploy the second cluster. In case of emergency, we can switch the traffic or deploy on demand, or we can have them active in both clusters. Nothing. Let's do a little more. Let's say we have a simple stateless application or software with incoming traffic. No encryption at this point, that probably never happens in real production environment. And at this point, we're doing manual DNS management, right? So what it will look like. So same deployment. We will have our container with port 80. We will have our service that is forwarding the traffic to this port 80. And we will have the ingress that will actually, listen to this, FQDN of Main Do Level Labs link, and it'll forward everything that is going to slash into our service. Again, easy, we just deploy to the first cluster. We can deploy it at the same time to the second clusters. So in case of the emergency, we can create some kind of a CNAME that will point to each of the clusters and or we can have it as active and we can have round robin or some kind of weighted DNS. We just have to adjust DNS. Not complicated. Now, if we're talking about persistent volumes, there are many ways to achieve that. And this is actually when you're searching, in the search engines, this is what you will find. So here's an example of how you can do it with Valero. You will just have a cluster, you will create a backup, put it in your object storage. You can have another cluster from which you can. Take it from the object storage and restore it into another cluster. Here's a link if you want to check how to do that. Another option, you can do it with commercial products like Portworx. So Portworx will be running continuously. This is a synchronic replication, disaster recovery. Portworx will take the data, periodically put it in object storage, then another cluster will pull it from the object storage and continuously restore it. And then you can execute commands to failover to another cluster and just launch the workload. Very convenient, very easy in terms of data. But that's not why we're here. We want to talk about Kubernetes and self service, right? And there are two great projects. One of them called external DNS and the other one is CertManager. So what do they do? Imagine a world where you can create DNS records automatically and you can generate certificates on demand. Now stop imagining and just create this ingress. And this ingress will actually, because of this annotation and CertManager that is installed, We'll go and issue the certificate for main. levlabs. link because of this TLS, and he will save the certificate in this secret, and he will also generate the A record in, route 53. Very convenient. Great self-service for any developer that just wants to deploy his software. This is what it'll look like. so you will create an ingress. This ingress will cause to go to route 53 and update the record. It'll go to less encrypt and generate the certificate, and this is great, but. Great power comes great responsibility, as we know. So let's talk about the problems that it creates. The first one is the routing traffic problem. I can't deploy the same ingress twice because of what I just showed you. If I deploy the same ingress twice, external DNS will now fight over who the record belongs to. So that's a problem. So I need an ingress for the active cluster and I need the ingress for the ingress And I need a CNAME for a main record that will actually forward the traffic to the relevant cluster. But there's a problem. If I create this CNAME for main and I will actually access my clusters where I have only ingress for active or ingress for the arm, no one will listen. to the main host header. So what I actually have to do is something that looks like that. I will have two clusters with ingress for primary levelups. link and dr. levelups. link that will auto configure the route 53. And I will have a separate record that's called main that I control. That will point to those ingresses that are actually forwarding to the same service. So this way I can control which cluster I want to access and when. Let's see how that looks in terms of configuration. So here we have an example of the DR, right? This is an ingress for the app DR that listens on the DR lablabs. link and forwards to the application of DR. And here we have an ingress for the primary, so it will listen on a host of primary. So those will create the route 53 records automatically. And then I will have the ingress for the main. As you can see what I'm doing here is I'm saying that the annotation for this one, for external DNS, is actually the host name is empty. This way, main. levlabs. link will not try to create any records in Route 53. And this is how I can create my CNAME and manually control who is the active cluster. And then in each cluster, it will point to the relevant server, whether it's active, whether it's DR, whether it's the same name in both of them. And this is the CNAME in my, Route 53, right? So you will see that it's a C name, and it's actually a forwarding now to the primary lab, levlab. link. So this is the record. This is what it points to. And I have to have very short TTL so I can adjust it and failover traffic quickly. Let's talk about the second problem, the certificates. Now we have an FQDN. Those, this is how our customers will access it, right? So we have FQDN is called main. levlabs. link, but it doesn't match dr. levlabs. link or primary. levlabs. link that are auto generated from our ingresses. So this is what it looks like. And those are the ingresses that are creating this certificate. And we have to figure some way out how to solve it. So here's how I'm doing it. I will have ingress of our sample app, let's say in the primary, but inside of the TLS, I will have multiple hosts. This is how I will create the SAN, the service alternative name, right? So I will have multiple FQDNs here. This will cause cert manager to go and issue the certificate. One certificate with both FQDNs and it will place it in this secret name in the ingress cert. Then I will actually create my main ingress for main level labs, but it's not actually going to Let's Encrypt at all. But instead, I have the same TLS hosts. And I'm leveraging the same secret that I issued here. So this way, I will have Ingress that is listening to me, that actually will have the certificate that matches its FQDN, and it will be able to receive the traffic. But it's not trying to generate the, Route 52 record or the certificate for itself. And this is what it will look like. Here's my safari that was able to successfully access, the site main level lobster link. And you can see that I actually have DNS name for Main and DNS for primary. And if I will fail over, I will actually have main and the dr. So what this solution actually looks like, if I take everything and I just show it in one picture. So I will have two AWS region, one IKEAs cluster in one. In one region, I will have self service for certificates, self service for DNS and functional DR that I can control where the traffic is going to. And this is how, right? So I will have my ingresses that will access Amazon route 53 and let's encrypt. And I will have, my. Ingresses that are forwarding to service that is forwarding into my deployment and potentially I can have persistent volume with data replication that I can copy, let's say, with Portworx or periodically with Valero commercial products like a step. It's important to remember that we are, it's a true Pandora box, right? For example, not everything has to be replicated. There are a couple of. gotchas. One example is that you cannot replicate or copy the same ingress into both places, especially if you're doing stuff like self service with self manager or external DNS. Another example is that you probably want to change some of the labels that are you, that you're restoring in another cluster. Maybe you need to change them, adjust them, something like this. So you can't just copy paste. A lot of people or a lot of companies forget about the dependencies. They're really focused on just making sure they can switch stuff from one cluster to another. But there are things that potentially close to the cluster that can fail. For example, the manifest repository, where you pulling all your yamls, all your helm charts, all your manifests. Second thing obviously is the image repository. It has to be accessible and available in the second site or from the second site, when you're doing the failover, obviously you have to address the monitoring, the ICD and authentication. Now, in addition, again, as you can see. Many of the companies really focused on the data. They copying data and from their perspective, this is all they had to do. But the reality is the connectivity aspects are usually ignored and they're super important to take in consideration. And you have to drill your DR, do your chaos engineering and just test this. Thanks a lot. It was helpful. If you have any questions, please talk to me. I will be available in the chat. Or email me. Thanks a lot.

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

When Clusters Collide: The Misadventures of Disaster Recovery in Kubernetes

Video size:

Abstract

Summary

Transcript

Lev Andelman

CTO @ TeraSky

Join the community!

Featured event

2026

2025

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

When Clusters Collide: The Misadventures of Disaster Recovery in Kubernetes

Video size:

Abstract

Summary

Transcript

Lev Andelman

CTO @ TeraSky

Join the community!