Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome to our session. Today we will
be talking about the efficiency and the resiliency of large
scale Kubernetes environments.
My name is Eli Birger and I'm a co founder and chief
technology officer of Perfectscale. Prior to establishing
perfect scale, I have managed DevOps and infrastructure team for
many years. I have built multiple large scale SaaS
systems, mainly based on the Kubernetes in the recent years.
My talk today will focus on the second day operation challenges and
specifically on the right sizing of Kubernetes environments.
The second day operation basically starts when your environment
goes live and you starting to serve real customers.
The second day operation is not a single milestone, but actually this
is the beginning of a long journey, the journey of
the day to day development and operations across the environment.
The entire day to day operations has a single purpose,
to provide the customers with the best possible experience using
the system and from the executive perspective,
the best possible experience but also with the lowest
possible cost. To achieve
this, Kubernetes ecosystem provide us with two types
of tools, the horizontal pole autoscaler. I personally
prefer Keda here and the cluster autoscaler. Some may
prefer carpenter. The combination of horizontal pole
autoscaler and the cluster autoscaler allows us to dynamically
change the entire environment. Environments scales up
and down horizontally according to the demands and the needs.
So it seems we just need to set up an HPA
and the cluster autoscaler and start enjoying the best possible
experience at the lowest possible cost.
So now when both the horizontal port autoscaler
and the cluster autoscaler are installed and configured, we are expecting
our environments to have a high resilience level combined with a
steady cost pattern following the demand fluctuations.
But when we look at the real data we will find something like
this, not always satisfying resilience
level and constantly growing cost. This is
a good sign that our system is not properly right sized.
Despite the presence of HPA and cluster autoscaler,
there is no magic here. Kubernetes horizontal scalability heavily
relays on the proper vertical sizing definitions of
podes and nodes. Let's see how
it works in details. Here is
a pod with the request of four cores of cpu and eight gigabyte
of memory. Those request values are defining how much
resources the node should allocate for the specific
pod when pod is assigned
to the node. Here we are looking now at the example of node
with eight cores of cpu and 16 gigabyte of memory. The relevant
fraction of node resources is reserved for our pod.
Now when kubernetes need to schedule additional pod,
it will place them on the same node only if remaining allocatable
of the node fits the pod request. For example,
this red pod with twelve gigabyte of memory request
cannot be assigned to the node.
Instead, this pod will go to the unschedulable queue
and cluster autoscaler, constantly monitoring
this unschedulable queue. And once there is a pod, it will
simply add a node to our cluster.
So both the cluster autoscaler and the HPA are tightly
coupled to the pod requests. Let's see how as
we saw in a previous slide, the cluster autoscaler will scale up amount
of nodes only when pod can't be scheduled
on existing nodes, and it will
scale down the particular node only if the sum of request
of the node is less than a threshold. By the way, the default threshold is
50% of allocation. So if the total allocations of
your node are more
than 50%, this node will
not be removed from the cluster even if there is enough space
for the podes running on this node to be
hosted on other nodes.
The same goes with the HPA or
specifically with the resources based HPA. New replicas
will start when the utilization of current pods
exceeds some percentage of pod requests,
and I would like to stress this thing again, the utilization
will exceed the request amount.
So now when we are understanding the importance of pod
requests, how do we actually right size our pods
and what are the correct values for the request and limit?
Here is a simple answer. We need to provision as few resources
as possible, but without compromising our performance.
The request should guarantee enough resources for a proper operation
and limit should protect our nodes from overutilization.
So let's see what happens in the misprovisioning
scenarios. If pod requests are too big, we will cause
waste and excessive co2 emission. If the requests are
under provisioned, kubernetes will not guarantee that pod will have enough
resources to run if we forgot to provision
requests at all. The Kubernetes will not allocate enough
resources for a pod on the node during the assignment.
This same as under provisioning may probably
cause unexpected pod eviction on the memory pressure
or cpu pressure.
As for the limit, under provisioning, limits will
cause cpu throttling or out of memory service
will fail on lack of resources during load bursts,
even if there is bunch of free resources available in entire cluster or
even in a particular node.
Over provisioned limits will set a wrong cutoff
threshold, ending up with the failure of the entire node.
Failure of the node under load spike can easily end up with a domino
effect and cause complete outage for our system.
Specifically for the cpu limit. In some situation it
is okay to remove cpu limit completely and
only cpu limit. We are not talking here about the memory limit at
all. This is because of compressible nature of cpu.
The complete fair scheduler of operating system
will figure out how to distribute cpu
time between different containers.
So finally our mission of right sizing
is clear. Let's roll our sleeves and set each
and every pod with few resources as possible without compromising
the performance. But how do we actually decide
what is the right amount of values?
Is it a half core or four cores? Is it 100 megabytes
or 1gb? Intuitively we
can try to calculate it based on the metrics. Or maybe we will just have
a VPA recommending some values for us.
It seems like an easy task. We just need all the service
owners to go workload by workload.
Look at all the metrics and adjust them accordingly for hundreds
of workloads in multiple clusters. And we also
will ask those service owners to keep going and
do it every time when there is a code change,
change in architecture or traffic patterns.
Unfortunately, it does not sound like a realistic
plan and this
level of complexity definitely requires a solution.
From my personal experience, good DevOps solutions are consist
70% of philosophy and 30% of technology.
The philosophy part of such solution for our
problem is to establish an effective feedback loop to pinpoint,
quantify and address relevant problems on the
technology part. The shift from data to intelligence
what is the difference between data and intelligence?
Data is not considered intelligence until it is something
that can be applied or acted upon.
In other words, human are not good in analyzing massive
amounts of data. It is boring and time consuming.
Switching from data to actionable intelligence will streamline the
decision making process.
This approach will allow to shift from continuous firefighting to
proactively pinpoint and predict and fix
problems to switch from guesstimation
mode to data driven decision making.
The end result of such approach will be improved
resilience, less SLA and SLO breaches, reduced waste
and carbon footprint, and effective governance of the platform.
Now let's see it in action.
So let's see how perfect scale
approach can help with right sizing kubernetes
here we see a cluster. This cluster contains 240
different workloads. Here they are deployment,
stateful set applications,
demon sets jobs the total cluster
cost for last one months is $3,687.
Let's see the big picture of our cluster.
Our cluster combined combined for the last
one months utilizes in 99%
of the time, 61 cores of cpu
or less. 261 gig memory
or less. The combined
number of the requests that are set together
for all the workloads in 99% of the
times is 156 cores
of cpu or less. Same goes for the memory. Four hundreds
and seven gig of memory or less.
Let's see the total allocated.
This is the size of our cluster and what we can easily
see that our cluster is nearly four times bigger than we
actually would need 99% of the times.
However, this picture shows
us that we have enough resources to run any workload
in this cluster. So we
detected 131 different resilience
issues related to the missing or misconfigured
resources such as requests or limits.
Let's see an example. This is a couch base.
It is a stateful set. It's running in a namespace
of prode. It's running
for 924 hours
within last one months. This number represents
the total uptime for all the replicas that this workload have.
For example, if we would observe 1 hour time frame and we
would have one replica, the number will be one. And if
we would have three replicas at the same hour, number for this hour
will be three. Then we understand on top
of which node this workload is running. We also
understand what fraction of the node is actually
optimized or allocated toward this workload.
So we eventually know how much the workload cost.
We are indicating a high resilience risk for
this workload. Let's see what the risk is.
Let's see what do we know about this workload? This workload
have somewhere between two to four replicas with average of three
replicas during last one months. And we
see a high throttling happening on the cpu and
why this trotting is happening. This rotoring is having
because this particular workload defined with 1000 milli
cores as a request, 3000 millicores as a limit.
In 95% of the time our utilization was two cores of
CPU and the highest spike that we observed is very
very close to the limit that we set.
This is why the struggling happens. Those values
might be correct at the moment they were set, but since
then many things changed. Maybe you have more customers, maybe you
have more data in the database. Maybe you have less efficient query or
more microservices pulling from the same database,
pulling data from the same database.
So perfect scale coming. Analyzing the behavior
of the workload of all the replicas of this workload and coming with
recommendations how much resources you would need to set in
order to run this workload smoothly.
Those recommendations are also combined into the
convenient YaML file that you can simply copy paste into
the infrastructure as a code and run the CI CD in order to
actually fix the problem. But in some situations
you are not the person to make the actual fix. There is
a service owner and he need to address the issue.
So we can simply create a task. This task will go directly
to the JIRA and later on can be assigned to the
relevant stakeholder and actually fit into the
normal workflow of the development lifecycle.
Additional perk we can set different resilience levels for
our workload. For example, if we running
production database, we would like to set much wider
boundaries for the workload. And if we
set it to the resilience of highest level, our recommendations
would be much bigger and we also
will calculate the impact of the change.
So this particular database in the highest level
of resilience would increase the monthly cost about
70 80%.
In the same way we're detecting the under provisioned workloads.
For example, this collector catcher is a deployment running in
pro namespace and we
spent $94 for this workload during last month's, out of
which $76 were completely wasted.
Let's see how. So this workload contains two different containers,
the Yeager agent that collected traces and the actual business logic
container. This business logic container is provisioned
with ten gig of memory for each replica. This one
is running from somewhere between one to six replicas
with average of three replicas. And the utilization is
somewhere around 2gb of memory. So we basically
throw in 8gb of memory for each replica that we
are running. Again, we have a handy yaml to fix the problem
and we can create a task in
a similar way. We are pinpointing all the different problems
that you have in your cluster or categorizing those problems by
risk. So you can either focus on
the highest risk in particular namespace
or you can go and dive into particular type
of the problem. For example, under provision memory limit.
Let's see again. Let's see it in action.
So we have the workload here. This workload suffers
from very low request in
95% of the times. We need three times
more resources and the
limit is very very close to the actual utilization.
Also, we observed the trend going
up with memory utilization. So we are basically predicting
here that at some point in time out
of memory will occur and we are suggesting
to fix the problem by increasing the amount of allocated resources
and increasing the limit. This is going to be our impact
of the change, but we will have this workload
running smoothly.
Now let's see the multicluster multicloud view.
In this view, we see each and every cluster
running in different clouds. We see all the problems that
this particular cluster have, all the waste and all the total cost,
and even the carbon footprint that this particular cluster generates.
We see how those numbers are summing up
in the organization level of view. How much
is the cost, how much is the waste, how much savings
we generated, and how much existing risks out there.
So I hope you are enjoyed our session today
and you learned something new about the right sizing and right scaling
of kubernetes. Feel free to ping me on
LinkedIn or contact me with your website.
Thank you very much for your time.