Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hello and welcome to our session. Today we will
            
            
            
              be talking about the efficiency and the resiliency of large
            
            
            
              scale Kubernetes environments.
            
            
            
              My name is Eli Birger and I'm a co founder and chief
            
            
            
              technology officer of Perfectscale. Prior to establishing
            
            
            
              perfect scale, I have managed DevOps and infrastructure team for
            
            
            
              many years. I have built multiple large scale SaaS
            
            
            
              systems, mainly based on the Kubernetes in the recent years.
            
            
            
              My talk today will focus on the second day operation challenges and
            
            
            
              specifically on the right sizing of Kubernetes environments.
            
            
            
              The second day operation basically starts when your environment
            
            
            
              goes live and you starting to serve real customers.
            
            
            
              The second day operation is not a single milestone, but actually this
            
            
            
              is the beginning of a long journey, the journey of
            
            
            
              the day to day development and operations across the environment.
            
            
            
              The entire day to day operations has a single purpose,
            
            
            
              to provide the customers with the best possible experience using
            
            
            
              the system and from the executive perspective,
            
            
            
              the best possible experience but also with the lowest
            
            
            
              possible cost. To achieve
            
            
            
              this, Kubernetes ecosystem provide us with two types
            
            
            
              of tools, the horizontal pole autoscaler. I personally
            
            
            
              prefer Keda here and the cluster autoscaler. Some may
            
            
            
              prefer carpenter. The combination of horizontal pole
            
            
            
              autoscaler and the cluster autoscaler allows us to dynamically
            
            
            
              change the entire environment. Environments scales up
            
            
            
              and down horizontally according to the demands and the needs.
            
            
            
              So it seems we just need to set up an HPA
            
            
            
              and the cluster autoscaler and start enjoying the best possible
            
            
            
              experience at the lowest possible cost.
            
            
            
              So now when both the horizontal port autoscaler
            
            
            
              and the cluster autoscaler are installed and configured, we are expecting
            
            
            
              our environments to have a high resilience level combined with a
            
            
            
              steady cost pattern following the demand fluctuations.
            
            
            
              But when we look at the real data we will find something like
            
            
            
              this, not always satisfying resilience
            
            
            
              level and constantly growing cost. This is
            
            
            
              a good sign that our system is not properly right sized.
            
            
            
              Despite the presence of HPA and cluster autoscaler,
            
            
            
              there is no magic here. Kubernetes horizontal scalability heavily
            
            
            
              relays on the proper vertical sizing definitions of
            
            
            
              podes and nodes. Let's see how
            
            
            
              it works in details. Here is
            
            
            
              a pod with the request of four cores of cpu and eight gigabyte
            
            
            
              of memory. Those request values are defining how much
            
            
            
              resources the node should allocate for the specific
            
            
            
              pod when pod is assigned
            
            
            
              to the node. Here we are looking now at the example of node
            
            
            
              with eight cores of cpu and 16 gigabyte of memory. The relevant
            
            
            
              fraction of node resources is reserved for our pod.
            
            
            
              Now when kubernetes need to schedule additional pod,
            
            
            
              it will place them on the same node only if remaining allocatable
            
            
            
              of the node fits the pod request. For example,
            
            
            
              this red pod with twelve gigabyte of memory request
            
            
            
              cannot be assigned to the node.
            
            
            
              Instead, this pod will go to the unschedulable queue
            
            
            
              and cluster autoscaler, constantly monitoring
            
            
            
              this unschedulable queue. And once there is a pod, it will
            
            
            
              simply add a node to our cluster.
            
            
            
              So both the cluster autoscaler and the HPA are tightly
            
            
            
              coupled to the pod requests. Let's see how as
            
            
            
              we saw in a previous slide, the cluster autoscaler will scale up amount
            
            
            
              of nodes only when pod can't be scheduled
            
            
            
              on existing nodes, and it will
            
            
            
              scale down the particular node only if the sum of request
            
            
            
              of the node is less than a threshold. By the way, the default threshold is
            
            
            
              50% of allocation. So if the total allocations of
            
            
            
              your node are more
            
            
            
              than 50%, this node will
            
            
            
              not be removed from the cluster even if there is enough space
            
            
            
              for the podes running on this node to be
            
            
            
              hosted on other nodes.
            
            
            
              The same goes with the HPA or
            
            
            
              specifically with the resources based HPA. New replicas
            
            
            
              will start when the utilization of current pods
            
            
            
              exceeds some percentage of pod requests,
            
            
            
              and I would like to stress this thing again, the utilization
            
            
            
              will exceed the request amount.
            
            
            
              So now when we are understanding the importance of pod
            
            
            
              requests, how do we actually right size our pods
            
            
            
              and what are the correct values for the request and limit?
            
            
            
              Here is a simple answer. We need to provision as few resources
            
            
            
              as possible, but without compromising our performance.
            
            
            
              The request should guarantee enough resources for a proper operation
            
            
            
              and limit should protect our nodes from overutilization.
            
            
            
              So let's see what happens in the misprovisioning
            
            
            
              scenarios. If pod requests are too big, we will cause
            
            
            
              waste and excessive co2 emission. If the requests are
            
            
            
              under provisioned, kubernetes will not guarantee that pod will have enough
            
            
            
              resources to run if we forgot to provision
            
            
            
              requests at all. The Kubernetes will not allocate enough
            
            
            
              resources for a pod on the node during the assignment.
            
            
            
              This same as under provisioning may probably
            
            
            
              cause unexpected pod eviction on the memory pressure
            
            
            
              or cpu pressure.
            
            
            
              As for the limit, under provisioning, limits will
            
            
            
              cause cpu throttling or out of memory service
            
            
            
              will fail on lack of resources during load bursts,
            
            
            
              even if there is bunch of free resources available in entire cluster or
            
            
            
              even in a particular node.
            
            
            
              Over provisioned limits will set a wrong cutoff
            
            
            
              threshold, ending up with the failure of the entire node.
            
            
            
              Failure of the node under load spike can easily end up with a domino
            
            
            
              effect and cause complete outage for our system.
            
            
            
              Specifically for the cpu limit. In some situation it
            
            
            
              is okay to remove cpu limit completely and
            
            
            
              only cpu limit. We are not talking here about the memory limit at
            
            
            
              all. This is because of compressible nature of cpu.
            
            
            
              The complete fair scheduler of operating system
            
            
            
              will figure out how to distribute cpu
            
            
            
              time between different containers.
            
            
            
              So finally our mission of right sizing
            
            
            
              is clear. Let's roll our sleeves and set each
            
            
            
              and every pod with few resources as possible without compromising
            
            
            
              the performance. But how do we actually decide
            
            
            
              what is the right amount of values?
            
            
            
              Is it a half core or four cores? Is it 100 megabytes
            
            
            
              or 1gb? Intuitively we
            
            
            
              can try to calculate it based on the metrics. Or maybe we will just have
            
            
            
              a VPA recommending some values for us.
            
            
            
              It seems like an easy task. We just need all the service
            
            
            
              owners to go workload by workload.
            
            
            
              Look at all the metrics and adjust them accordingly for hundreds
            
            
            
              of workloads in multiple clusters. And we also
            
            
            
              will ask those service owners to keep going and
            
            
            
              do it every time when there is a code change,
            
            
            
              change in architecture or traffic patterns.
            
            
            
              Unfortunately, it does not sound like a realistic
            
            
            
              plan and this
            
            
            
              level of complexity definitely requires a solution.
            
            
            
              From my personal experience, good DevOps solutions are consist
            
            
            
              70% of philosophy and 30% of technology.
            
            
            
              The philosophy part of such solution for our
            
            
            
              problem is to establish an effective feedback loop to pinpoint,
            
            
            
              quantify and address relevant problems on the
            
            
            
              technology part. The shift from data to intelligence
            
            
            
              what is the difference between data and intelligence?
            
            
            
              Data is not considered intelligence until it is something
            
            
            
              that can be applied or acted upon.
            
            
            
              In other words, human are not good in analyzing massive
            
            
            
              amounts of data. It is boring and time consuming.
            
            
            
              Switching from data to actionable intelligence will streamline the
            
            
            
              decision making process.
            
            
            
              This approach will allow to shift from continuous firefighting to
            
            
            
              proactively pinpoint and predict and fix
            
            
            
              problems to switch from guesstimation
            
            
            
              mode to data driven decision making.
            
            
            
              The end result of such approach will be improved
            
            
            
              resilience, less SLA and SLO breaches, reduced waste
            
            
            
              and carbon footprint, and effective governance of the platform.
            
            
            
              Now let's see it in action.
            
            
            
              So let's see how perfect scale
            
            
            
              approach can help with right sizing kubernetes
            
            
            
              here we see a cluster. This cluster contains 240
            
            
            
              different workloads. Here they are deployment,
            
            
            
              stateful set applications,
            
            
            
              demon sets jobs the total cluster
            
            
            
              cost for last one months is $3,687.
            
            
            
              Let's see the big picture of our cluster.
            
            
            
              Our cluster combined combined for the last
            
            
            
              one months utilizes in 99%
            
            
            
              of the time, 61 cores of cpu
            
            
            
              or less. 261 gig memory
            
            
            
              or less. The combined
            
            
            
              number of the requests that are set together
            
            
            
              for all the workloads in 99% of the
            
            
            
              times is 156 cores
            
            
            
              of cpu or less. Same goes for the memory. Four hundreds
            
            
            
              and seven gig of memory or less.
            
            
            
              Let's see the total allocated.
            
            
            
              This is the size of our cluster and what we can easily
            
            
            
              see that our cluster is nearly four times bigger than we
            
            
            
              actually would need 99% of the times.
            
            
            
              However, this picture shows
            
            
            
              us that we have enough resources to run any workload
            
            
            
              in this cluster. So we
            
            
            
              detected 131 different resilience
            
            
            
              issues related to the missing or misconfigured
            
            
            
              resources such as requests or limits.
            
            
            
              Let's see an example. This is a couch base.
            
            
            
              It is a stateful set. It's running in a namespace
            
            
            
              of prode. It's running
            
            
            
              for 924 hours
            
            
            
              within last one months. This number represents
            
            
            
              the total uptime for all the replicas that this workload have.
            
            
            
              For example, if we would observe 1 hour time frame and we
            
            
            
              would have one replica, the number will be one. And if
            
            
            
              we would have three replicas at the same hour, number for this hour
            
            
            
              will be three. Then we understand on top
            
            
            
              of which node this workload is running. We also
            
            
            
              understand what fraction of the node is actually
            
            
            
              optimized or allocated toward this workload.
            
            
            
              So we eventually know how much the workload cost.
            
            
            
              We are indicating a high resilience risk for
            
            
            
              this workload. Let's see what the risk is.
            
            
            
              Let's see what do we know about this workload? This workload
            
            
            
              have somewhere between two to four replicas with average of three
            
            
            
              replicas during last one months. And we
            
            
            
              see a high throttling happening on the cpu and
            
            
            
              why this trotting is happening. This rotoring is having
            
            
            
              because this particular workload defined with 1000 milli
            
            
            
              cores as a request, 3000 millicores as a limit.
            
            
            
              In 95% of the time our utilization was two cores of
            
            
            
              CPU and the highest spike that we observed is very
            
            
            
              very close to the limit that we set.
            
            
            
              This is why the struggling happens. Those values
            
            
            
              might be correct at the moment they were set, but since
            
            
            
              then many things changed. Maybe you have more customers, maybe you
            
            
            
              have more data in the database. Maybe you have less efficient query or
            
            
            
              more microservices pulling from the same database,
            
            
            
              pulling data from the same database.
            
            
            
              So perfect scale coming. Analyzing the behavior
            
            
            
              of the workload of all the replicas of this workload and coming with
            
            
            
              recommendations how much resources you would need to set in
            
            
            
              order to run this workload smoothly.
            
            
            
              Those recommendations are also combined into the
            
            
            
              convenient YaML file that you can simply copy paste into
            
            
            
              the infrastructure as a code and run the CI CD in order to
            
            
            
              actually fix the problem. But in some situations
            
            
            
              you are not the person to make the actual fix. There is
            
            
            
              a service owner and he need to address the issue.
            
            
            
              So we can simply create a task. This task will go directly
            
            
            
              to the JIRA and later on can be assigned to the
            
            
            
              relevant stakeholder and actually fit into the
            
            
            
              normal workflow of the development lifecycle.
            
            
            
              Additional perk we can set different resilience levels for
            
            
            
              our workload. For example, if we running
            
            
            
              production database, we would like to set much wider
            
            
            
              boundaries for the workload. And if we
            
            
            
              set it to the resilience of highest level, our recommendations
            
            
            
              would be much bigger and we also
            
            
            
              will calculate the impact of the change.
            
            
            
              So this particular database in the highest level
            
            
            
              of resilience would increase the monthly cost about
            
            
            
              70 80%.
            
            
            
              In the same way we're detecting the under provisioned workloads.
            
            
            
              For example, this collector catcher is a deployment running in
            
            
            
              pro namespace and we
            
            
            
              spent $94 for this workload during last month's, out of
            
            
            
              which $76 were completely wasted.
            
            
            
              Let's see how. So this workload contains two different containers,
            
            
            
              the Yeager agent that collected traces and the actual business logic
            
            
            
              container. This business logic container is provisioned
            
            
            
              with ten gig of memory for each replica. This one
            
            
            
              is running from somewhere between one to six replicas
            
            
            
              with average of three replicas. And the utilization is
            
            
            
              somewhere around 2gb of memory. So we basically
            
            
            
              throw in 8gb of memory for each replica that we
            
            
            
              are running. Again, we have a handy yaml to fix the problem
            
            
            
              and we can create a task in
            
            
            
              a similar way. We are pinpointing all the different problems
            
            
            
              that you have in your cluster or categorizing those problems by
            
            
            
              risk. So you can either focus on
            
            
            
              the highest risk in particular namespace
            
            
            
              or you can go and dive into particular type
            
            
            
              of the problem. For example, under provision memory limit.
            
            
            
              Let's see again. Let's see it in action.
            
            
            
              So we have the workload here. This workload suffers
            
            
            
              from very low request in
            
            
            
              95% of the times. We need three times
            
            
            
              more resources and the
            
            
            
              limit is very very close to the actual utilization.
            
            
            
              Also, we observed the trend going
            
            
            
              up with memory utilization. So we are basically predicting
            
            
            
              here that at some point in time out
            
            
            
              of memory will occur and we are suggesting
            
            
            
              to fix the problem by increasing the amount of allocated resources
            
            
            
              and increasing the limit. This is going to be our impact
            
            
            
              of the change, but we will have this workload
            
            
            
              running smoothly.
            
            
            
              Now let's see the multicluster multicloud view.
            
            
            
              In this view, we see each and every cluster
            
            
            
              running in different clouds. We see all the problems that
            
            
            
              this particular cluster have, all the waste and all the total cost,
            
            
            
              and even the carbon footprint that this particular cluster generates.
            
            
            
              We see how those numbers are summing up
            
            
            
              in the organization level of view. How much
            
            
            
              is the cost, how much is the waste, how much savings
            
            
            
              we generated, and how much existing risks out there.
            
            
            
              So I hope you are enjoyed our session today
            
            
            
              and you learned something new about the right sizing and right scaling
            
            
            
              of kubernetes. Feel free to ping me on
            
            
            
              LinkedIn or contact me with your website.
            
            
            
              Thank you very much for your time.