Scaling production grade Kubernetes Multi-Cluster environments using GitOps

Video size:

Abstract

Learn how to manage and scale productive applications deployments in a multi cluster environment easily and securely using GitOps patterns. We will discuss the deployment of applications across multiple Kubernetes clusters, explore the best practices for managing highly available and reliable.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

My name is Yuri Besonov, and I'm a Partner Solution Architect at AWS. I work with different AWS customers and partners and help them to build solutions on AWS. One of the areas where I'm involved more is containers, especially Kubernetes. Kubernetes on AWS is delivered by Amazon Elastic Kubernetes Service, in short, Amazon EKS. And today, I will share with you how we can build and scale production grade Kubernetes multi cluster environment using GitOps on EKS. But before we go multi clusters, let's start from the very beginning, from one cluster and define why do we need cluster management at all. We generally see customers with a Git source control system, some kind of infrastructure as a code tooling like Terraform, CloudFormation, CDK, Pulumi, some infrastructure as a code state like S3 bucket for Terraform, and then resources. And EKS cluster. In the pipeline, customers usually do some checks on the source code, they build an application, then run unit tests, enforce policies, and after that they need to deploy application in the cluster. And as they grow, they may add additional environments, like for development, staging, for test purposes, or for different projects. It is still manageable. If growth continues, this can become unmanageable. To many versions, to many clusters, to many pipelines to track, which cluster is up to date, which cluster has which version, which needs to be updated this quarter or next quarter. And this is even without taking into account what is inside the clusters. What are the applications, what are their update policies. And it is uncommon, not uncommon, It's not uncommon to see so many pipelines in an organization. Why do we have so many clusters? Because we have different light of businesses, we have different applications, and different workloads. We could have back end application system, web application for front end, or even state workloads. Workloads to host databases or streaming workloads. There are plenty of legacy applications which are running in the clusters and also up to date modern applications like artificial intelligence, machine learning to work with machine learning and training and inference. And besides deploying the clusters, we also need to think how teams will access those clusters and will share resources. How we will configure each clusters with add ons, maybe with different configuration for developing and prod. For example, different number of replicas in developing and prodding environment, and also how to deploy teams. different workloads to different clusters. So there are plenty of challenges when you need to work with containers, with EKS, with Kubernetes at scale. And of course, our pipeline could help us to manage the add on, deploy the infrastructure, and enforce our compliance and security. It is possible with pipeline, and in this case, the actual model will be push model. We trigger the deployment manually each time when we decide that we need to deploy a new version of an application. But what happens in case of some issues in the chain? Or what will be the state of truth? If there are some problems arise, do we consider git as an initial state of truth? Do we consider terraform state file as a state of truth? Or do we consider pipeline itself as a state of truth? Or maybe it's EKS etcd database is a state of truth? With this setup, environment cannot match what your intention was. It is not source of truth anymore, and it's more, I would say, source of hope. And with this architecture, you need a lot of auditing and compliance mechanism. And what about if we need to integrate new services? In this case, we need to, for example, we would like to use generative AI services like Amazon Bedrock. In this case, we need product manager for pipelines who need to integrate this. And if we have many business units that they can be overwhelmed and maybe new services, will never be deployed. If we rely on GitHub tooling for fleet management, we can have a process that will constantly try to reconcile the source of truth in the Git repository with the actual state of the cluster. So we can, and we should solve those problems which we had with the push model. In this case, Git becomes a source of truth instead of source of hoax. Hope, and it will enable reproducible automated deployments, cluster management, and monitoring. So we can use GitOps and various tooling to solve our problems with deployment. And we also use a principle of GitOps to achieve it. Those principles are declarative. We define what do we want to achieve, not how do we want to achieve. It's reduce complexity. The second principle is versions. All the configuration are in Git, versionable and immutable, and it enhance auditability. We know who did what and when. We also use pull model, whereas an agent in the cluster which pull desired state from the git instead of pushing this state to the cluster from the pipeline. And also, we have an agent which is continuously reconciles it to the cluster. Agents know how to translate declarative configuration in the git repository to the real state of the cluster. And do it constantly so after each change we always have the consistent environment. And let's now go deeper and talk how we deploy the clusters. Most of the customers are using many of the customers using Terraform to deploy EKS clusters. We have a nice accelerator called EKS Blueprints that can help you to achieve your deployment goals quickly by allowing you to rely on maintaining and best practices pattern of EKS deployment. It can bring you some patterns like fully private cluster. Clusters which use APV6 with Node Group or Carpenter for analytics or machine learning. For various purposes, we have various patterns already developed for you, which you can use in your own environment. Let's see the process to deploy an add on, like a load balancer. Because the cluster itself, it's just compute environment. And in order to use it, you need various components, like a various add on. For AWS load balancer, for example in order to make it work, we will need a role and a policy which gives a controller to possibility to access application AWS resources, such as AWS load balancer. The role will reference a pro policy with appropriate right. If we decide to addon to install the addon Helm chart with Terraform, we can configure it with previously created role. Then Terraform installs the configured Helm chart into the cluster, and the service account can reference the proper AM role. Then it's again push model from Terraform. We still hit the problem of of which Terraform state is a state of truth. And Terraform apply, it's only applied when we manually when we'll be manually triggered. Can we do better? Of course, we can do. Can we use, for example, Git approach and Argo CD? We can ask Argo CD to talk to Kubernetes and do the Helm install on the addons via creation of application object that reference the Helm chart repo. But then, we still need a way to provide it with appropriate configuration. In this case, the AM role for load balancer controller. With Argo CD we there is a notion of application set. Application set allow us to dynamically create or generate Argo CD application objects. Application set are able to read a secret in the cluster and apply its configuration to some of the labels of the secret. But first, we need a way for our Teraform to create the secrets and then the specific annotation for that secret. In this case, Teraform creates a load balancer controller IAM role, reference this role as annotation in the secret and create the secret in Kubernetes. This way, Argo CD should have all the necessary materials to create an application and install properly the add on. Once Kubernetes has the secret configured, Argo CD can create an application for its application set manifest with appropriate configuration and install the add on. So we have complete cycle. To recap, Terraform creates a role. Terraform creates the Kubernetes secrets with appropriate annotation for this role. Then Argo CD creates the application from application set. This is what we call the GitHub bridge. It is fully integrated with EKS blueprints and you can rely on this pattern to keep your Terraform infrastructure as to keep your Terraform infrastructure as code management but shift the Kubernetes deployment part to Argo CD. So in this case, Terraform will be responsible for creation of the clusters and creation of the secrets, but after that Argo CD will take over and deploy all the workload and all the necessary add on using application set. This can be used for add ons, but also for workloads when your workloads can be configured dynamically from the metadata created by Terraform. From there, we have different options for managing Argo CD, our clusters and GitHub repository for add ons and workload. We can have a different cluster pointing to the same Git add ons repository, where we will store the generic application set. Each application set can then be transformed by Argo CD in different applications. which can have different values depending on the target cluster we are running on. So you have a set of application sets which are acting like a template and after that you could have specific parameters for specific clusters which can be generated on the fly. We can also use, in this case, centralized Argo CD instance in a hub cluster that will be managed that that will be able to manage application deployment in different target cluster from a centralized place. This is what we call a hub and spoke model. When you have a hub management cluster and you have workloads cluster for different environment to different projects to let's say which will be controlled by one instance of Argo CD. We can leverage Terraform to create our hub and spoke clusters with a simple hub and spoke topology. We can also leverage Terraform workspaces to manage a different Terraform configuration for our different cluster but reusing the same Terraform code. With this setup, we need many secrets created in the hub cluster. One secret for each of that. Target cluster, like a hub cluster dev clusters staging, or production clusters. Argo CD will be able to leverage those different secrets when we need to communicate with each cluster. So Argo CD will help necessary, will have necessary credentials to access workloads cluster from the centralized hub cluster. Application sets stored in Git will generate different applications in the HUB cluster. We can also specify different add ons versions and other configurations by matching labels on the target clusters. Within the single repository, you can still control different versions of the HUB cluster. to be deployed on different cluster and centralized in one location. So you have a benefits of centralization of configuration, but after that flexibility to have different different parameters, different specific parameters for various clusters. Clusters, then you can scale this with many other clusters. So you can have several clusters for staging or for example, complete set of clusters like dev staging and production for various project or various lines of business. As I said, many of our customers are using Terraform and e Ks blueprints for Terraform to create their infrastructure. While Terraform is really powerful tool for Token to AWS, we have seen several issues when letting Terraform manage Kubernetes resources inside the clusters, like a Don and workloads and what we just saw. We can delegate some of the Kubernetes stuff from Terraform to Argo CD using. GitOps bridge. Terraform will still create resources, AWS resources, and Kubernetes secrets with with metadata indentation. From there, Argo CD is able to configure and install the Kubernetes add on, like external DDS, GitTip, or SearchSecret to one or many clusters. With the GitOps approach, we could also manage the teams. We can create with ArgoCD namespaces. We could enforce role basis access control and policies and everything using GitOps and using configuration, which is stored in ArgoCD. Some customers are even able to do one step faster. further, for example, and let Argo CD sends to tooling like ACK or cross plane to also create necessary AWS resources from VPC to EKS clusters. In this case, we don't talk about Kubernetes as container orchestration system, but as a platform framework. We are building a fabric for people To bring different pieces of infrastructure together using native constructs like mutating admissions, Hema validation admissions, and all those benefits of Kubernetes API. With that, with the help of GitOps, Kubernetes API, we could be able to deploy not only workloads to the clusters, but also the clusters and all the necessary resources which you require in AWS, for example, for your application to run, like SRE buckets, databases, and so on and so forth. Of course In these clusters, we will need some workloads. For that workloads, we will need configuration management for different applications, for different versions of those applications. We also can use one Git repository to store configuration for different applications. Like in this case, for example, as you can see, you have some UI application, you have some database application. And if needed, you can also differentiate the configuration needed for different clusters or environment of clusters in the Git repository. So you may have different values for your development clusters and and for example, for production clusters, of course, you will have different, for example, parameters for secrets or access to the database. But also you could have a various number of replicas or other parameters, which will be different from environment to environment. RGCD could reconcile the appropriate directories. The directory is dependent on the target clusters, so you could have a configuration of specific cluster in specific directory. With this, you can create different configurations like version, scale, and optimize cost for compute and storage. As we we saw, you may also choose to create additional resources with GitOps using manifest and add ons for AWS controllers for Kubernetes or cross plane, for example, to create Amazon RDS databases associated with specific application. So you could have a configuration for application, but also for resources which are necessary for this application in AWS cloud. And of course, any talk. cannot be complete without some mentioning of security. Security is top priority for AWS and clusters are not exceptions from it. You, or you would want to store the secret materials for your application, for example, a private SSH key, keys or secrets in secure location like AWS Secret Manager. Then the controller like external secrets operator would create and update the secret bases on the material from AWS Secret Manager. When external secret operator creates the Kubernetes secrets with git credentials. Or and URL hostname, it can be also validated by Kiverna, by policy agent. Kiverna can have a policy that only allows credentials for Argo CD to be type, for example, type of SSH only. And it also could check if those credentials are coming only from specific Git repository or Git server or from specific region. It could be could be quite secure. And Argo CD, of course, with Argo CD another let's say tip you should never access remote EKS clusters using service account tokens that never expire. You should never use those tokens. Argo CD controllers like application controller, API server, or report server should always be configured to use IAM role using IRCA or EKS pod identity. When adding another remote cluster, specify the IAM role that Argo CD should assume when connecting to the more cluster. This also could be stored in the Kubernetes secret. A Kibernet policy can be configured to ensure that your cluster secret created only contains IAM role to assume and never, uh, KubeConfig token. Of course, you can replicate this, the same approach with many clusters and use it to access not only for example, staging clusters, but also production clusters and any other required clusters in a secured way. To sum up, AKS blueprints. and GitOps bridge give you a possibility to create and manage multi cluster environments with security and at scale using GitOps principles. You can check those QR codes and the links if you want to dive deeper, get additional information, or even you can try. hands on in your own on AWS account the Argo CD on EKS workshops, which we prepared for you. With that, I would like to say thank you. I hope you enjoyed the session and please feel free to reach out on LinkedIn and have a discussion about platform engineering, containers application modernization. I would really appreciate that. Also, if you spend one minute and answer on a short survey on the session, your feedback is very valuable. Thank you.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

Scaling production grade Kubernetes Multi-Cluster environments using GitOps

Video size:

Abstract

Summary

Transcript

Slides

Yuriy Bezsonov

Senior Solutions Architect @ AWS

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

Scaling production grade Kubernetes Multi-Cluster environments using GitOps

Video size:

Abstract

Summary

Transcript

Slides

Yuriy Bezsonov

Senior Solutions Architect @ AWS

Join the community!