Transcript
This transcript was autogenerated. To make changes, submit a PR.
This is Giovannijilisko and in the next 20 minutes or so I'll
share with you some of our experiences in tuning applications running
on kubernetes. These are the contents that
we will cover. We'll start by identifying some challenges
of modern applications for ensuring performance and reliability.
We'll then review how Kubernetes manages container
resources and the factors we need to be aware of if we want
to ensure high performance and cost efficiency.
We will introduce a new approach we implemented at Akamas,
which leverages machine learning to automate the optimization process,
and we will do that with a real world example.
Finally, we will conclude by sharing some takeaways.
Before proceeding, let me introduce myself. My name Giovanni Paolo
Gibilisco and I serve as head of engineering at Akamas.
Okay, let's start with a quick overview of some of the main challenges that
comes with the development of modern applications. The advent of
agile practices allowed developers to speed up the development cycle
with the goal of getting rapid feedback and iteratively
improve applications. Though increasing the release frequency,
it's now common to see applications, or part of them,
released to production weekly or even daily. At the same time,
the underlying frameworks and runtimes, such as the JVM
that are used to build those applications, have grown in complexity.
The emergence of architectural patterns such as microservices
have also brought an increase in the number of frameworks and technologies
used within a single application. It's now common to
see application composed by tens or even hundreds of services,
written in different languages and interacting with multiple
runtimes and databases. Kubernetes provides a
great platform to run such applications, but it has its
own complexities. These Kubernetes
failure stories is a website specifically created
to share incident reports in order to allow the community
to learn from failures and prevent them for further
happening. Many of these stories describe teams struggling with
Kubernetes application performance and stability issues such
as unexpected cpu's, loaddowns and even sudden container
terminations. Engineers at Airbnb even got
to the point of suggesting that Kubernetes may actually hurt the
performance of latency sensitive applications. But why
it's so difficult to manage application performance, stability and efficiency
on Kubernetes? The simple answer is that Kubernetes is a
glaze platform to run containerized applications, but it requires
applications to be carefully configured to ensure high performance
and stability, as we're going to see. To answer
this question, let's now get back to the fundamentals and see how Kubernetes
resource management works to better understand the
main parameters that impact Kubernetes application performance,
stability and cost efficiency. Let's go through
five main key aspects and their implications.
The first important concept is resource requests.
When a developer is defined pod, she has the possibility to
specify resource results. These are the amount of cpu
and memory the pod or better, a container within the pod
is guaranteed to get. Kubernetes will schedule
the pod on a node where the requested resources are
actually available. In this example,
pod a acquires two cpus and is scheduled on a
four cpu node. When a new pod b on the
same side is created, it can also be scheduled on
the same node. This node now has all of its four
cpus requested. If a pod c is created,
Kubernetes won't schedule it on the same node as its
capacity is full. This means that those numbers these developers
specify in the deployment yaml are directly affect the
cluster capacity. A strong difference with respect to virtualization
and hypervisors is that with Kubernetes there is
no overcommitment on the requests. You cannot request more
cpus than those available in the cluster.
Another important aspect is that resource requests are
not equal to utilization. If pod requests are
much higher than the actual resource usage, you might end up with these cluster
that is at full capacity even though its cpu utilization is
only 10%. So the takeaway
here is that setting proper pod request is
paramount to ensure Kubernetes cost efficiency. The second
important concept is resource limits.
Resource requests are guaranteed resources that
a container will get, but usage can be higher.
Resource limits is the mechanisms that allows you to define
the maximum amount of resources that a container can use,
like two cpus or 1gb of memory.
All this is great, but what happens when resource usage hits
the limit? Kubernetes treat cpu and memory
differently here. When a cpu usage approaches the limit,
the container gets throttled. This means that these cpu is artificially
restricted and this usually results in application performance
issues. Instead, when memory usage hits the limit,
the container gets terminated, so there is no application
slowdown due to paging or swapping as we had in traditional
operating systems. With the Kubernetes your pod will
simply disappear and you may face serious application stability
issues. The third fact is about an important and less
known effect that cpu limits have on application performance.
We have seen that cpu limits called throttling and you may think that
this happens only when cpu usage hits the limit.
Surprisingly, the reality is that cpu throttling starts
even when cpu usage is well below the limit. We did
quite a bit of research on this aspect in our labs and found
that cpu throttling start when cpu usage is as
low as 30% of the limit. This is due
to a particular way cpu limits are implemented at the Linux kernel level.
These aggressive cpu throttling has a huge impact on
service performance. You can get sudden latency spikes
that may breach your slos without any apparent reason,
even at low cpu usage. Now, some people,
including engineers at buffers, tried to remove cpu limits.
What these got was an impressive reduction of service latency.
So is it a good idea to get rid of cpu limits?
Apparently not. Cpu limits exist to
bound the amount of resources a container can consume.
This allows many containers to coexist without competing
for the same resources. So if cpu limits are
removed, a single Runway container can disrupt the performance
and availability of your most critical services.
It might also make the Kubelet service unresponsive and effectively
remove the entire node from the cluster using
cpu limits. Is these best practice also recommended by Google?
Properly setting your cpu requests and limits is critical
to ensuring your Kubernetes cluster remains stable and efficient
over time. To ease the management of
limits and requests for many services, Kubernetes comes with
autoscaling. Let's discuss built in autoscaling
capabilities that are often considered as a way to automate this
process. In particular, the vertical pod autoscaler
or VPA provides recommended cpu and memory requests
based on the observed pod resource usage.
However, our experience with a VPA is mixed.
In this example, a Kubernetes microservice is serving a typical
dernel traffic pattern. The top left chart shows the
latency of this service and its service level objective,
while below you can see the resource request, cpu and memory,
and the corresponding resource utilization.
We let this service run for a couple of days with some initial resource
sizing, then activated the VPA and let it applied the new
recommended setting to the pod.
It's interesting to see that the VPA immediately decided to
reduce these assigned resources. In particularly, it cut in
half the cpu requests. This is likely due
to some apparent overprovisioning of these service as the cpu
utilization was below 50%.
However, with the new settings suggested by the VPA,
the latency of the microservice skyrocketed, breaching our slos.
What is the lesson heard here? Kubernetes autoscaling and
the VPA in particular is based on resource usage and
does not consider application level metrics like response time.
We need to evaluate the effect of the recommended settings as they
might be somewhat aggressive and cause severe service performance
or reliability degradations as
we've seen so far, optimizing microservice applications on Kubernetes
is quite tuning tasks for developers,
sres and performance engineers. Given the complexity of tuning
Kubernetes resources and the many moving facts we have
in modern applications, a new approach is required
to successfully solve this problem and this is where machine learning
can help. AI and machine learning have revolutionarized
entire industries and the good news is that ML can be
used also in the performance tuning process. ML can automate
the tuning of many parameters we have in the software stack with
the goal of optimized application performance, resiliency and cost.
In this section I would like to introduce you to this new methodology.
Real world case is about an european leader in accounting,
payroll and business management software. These Java cases microservice
applications are running either on Azure or AWS Kubernetes
services the target system of the optimization
is the b two b authorization service running on Azure. It's a
business critical service that interacts with all the applications powered
in the digital services provided by the company.
These challenge of the customer was to avoid overspending and
achieve the best cost efficiency possible by enabling development teams
to optimize their applications while keeping on releasing application
updates required to introduce new business functionalities and align
to new regulations. So what is the goal
of this optimization? In this scenario, the goal was to reduce
the cloud costs required to run the optic authentication service on Kubernetes.
At the same time, we also wanted to ensure that service would
always meet its reliability targets which are expressed as latency,
throughput and error rate slos. So how can we
leverage ML to achieve this high level business goal?
In our optimization methodology, DML changes
the parameters of the system to improve these metric that we have defined.
In this case, the goal is simply to optimize the application cost.
This is a metric that represents the cost we pay to run the application on
the cloud, which depends on these amount of cpu and memory resources
allocated to the containers. The ML power optimization
methodology also allows to set constraints to define
which configurations are acceptable. In this case, we state
that the system throughput, response times and error rate should
not degrade more than 10% with respect to the baseline.
Once we have defined the optimization goal, next step is
to define these parameters of these system that machine learning
can optimize to improve our goal. In these scenario,
nine tunable parameters were considered. In total, four parameters
are related to Kubernetes container sighting, cpu and memory request and
limits which play a big role in the overall service performance,
cost and reliability and five parameters are related
to these JVM, which is the runtime that runs within the container.
Here we included parameters like heap size, garbage collector,
the size of the regions of the heap, which are important options to improve the
performance of Java apps. It's worth noticing that
the ML optimizes the full stack by operating on all these
nine parameters at the same time, thereby ensuring that the
JVM is optimally configured to run within the
chosen container. Resource sightseeing let's now
see how the ML Tower optimization methodology works. In practice.
The process is fully automated and works in five
sres. The first step is to apply the new configurations suggested
by the ML algorithms to our target system.
This is typically done leveraging Kubernetes APIs to
set the new value to the parameters, for example the CPU request.
The second step is to apply a workload to the target system
in order to assess the performance of the new configuration.
This is usually done by leveraging performance testing tools.
In this case, we use a geneter test that was already available to
stress the application with a realistic workload.
The first step is to collect KPIs related to the target
system. The typical approach here is to leverage observability
tools. In this case, we integrated elastic APM,
which is the monitoring solution used by these customer.
The fourth step is to analyze the result of the performance test
and assign a score based on the specific goal that you have defined.
In this case, the score is simply the cost of running the application
containers. Considering the prices of azure cloud.
The last step is where the machine learning kicks in by taking
the score of the tested configurations as input and producing as
an output the most promising configuration to be tested in the
next iteration. In a relatively short amount of
time, the ML algorithm learns the dependencies between the configuration
parameters and the system behavior. Though identifying better and
better configurations. It's worth noticing that the whole optimization
process becomes completely automated.
So what are we getting as an output of the MLbase optimization?
The main result is the best configuration of the software stack parameters
that maximizing or minimize the goal we have defined.
These parameters can be then applied in production environments,
but the value this methodology can bring is actually much
higher. Amal will evaluate many different configurations
of the system, which can reveal important insights about the overall
system behavior in terms of other KPIs like cost, performance or
resiliency. These supports performance engineers and
developers in their decision on how to best configure these application to
maximizing the specific goals. So,
to assess the performance and cost efficiency of a new configuration suggested
by the ML optimizer. We stress the system with these load
test here you can see the load test scenario that we use just
designed according to the performance engineering best practices.
The traffic pattern mimicked the behavior seen in production,
including API call distribution and sync times.
Before looking at the results, it's worth commenting on the application
on how the application was initially configured by the customer.
We call this these baseline configuration. Let's look
at the Kubernetes settings first. The container powering these
application was configured with resource requests of 1.5 cpus
and 3.42gb of memory. The team also
specified resource limits of two cpus and
4.39gb of memory. Remember,
the requests are the guaranteed resource that kubernetes
will use for scheduling and capacity management of the cluster.
In this case, requests are lower than the limit.
This is a common approach to guarantee resources for the application
to run properly, but at the same time allow for some room
for unexpected growth.
Besides looking at the container settings, it's important to also see
how the application runtime is configured. The runtime is
what ultimately powers our application, and for Java apps
we know that JVM settings play a big role in app
performance, but the same happens for goaling applications.
For example, the JVM was configured with a minimum
cheap of half a gig and a max heap of 4gb.
Notice that the max heap is higher than the memory results,
which means that the JVM can use more memory than the amount
requested. As we're going to see, this configuration will have
an impact on how the application behaves under load and the associated
resiliency and costs.
It's worth noting that these customer also defined autoscaling
policies for this application, leveraging the Ka autoscaling
project for kubernetes in their environment,
both cpu and memory were defined as scalers with
a triggering threshold of 70% and 90% utilization,
respectively. What is important to keep in mind is
that such utilization percentage are related to the resource request,
not limits. So as you can see in the diagram on
the right can action to scale out the application will happen,
for example when the cpu usage will got above one
core. Okay, we've covered how the
application is configured. Let's now look at the behavior of the application when
subject to the load test we've shown before with the baseline configuration.
In this chart you can see the application throughput response time and the
number of replicas that were created by the autoscaling.
Two facts are important to notice. When the load increases, the autoscaling
triggers a scaleout event which creates a new replica.
This event causes a big spike on response time which impacts
service reliability and performance. This is due to the
high cpu usage and throttling during the JVM startup.
When the load drops, the number of replicas does not scale down.
Despite these, container cpu usage is idle.
It's interesting to understand why this is happening. This is
caused by the configuration of the container resource, the JVM
tuning inside, and these autoscaler policies in particular for the
memory resources. The autoscaler in this case
is not scaling down because the memory usage of the container is
higher than these configured threshold of 70% usage with
respect to the memory requests. These might be due to the JDM
Max heap being higher than the memory request we've seen
before, but it max also be due to a
change in the application memory footprint, for example due to a new
application release. This effect clearly impacts the
cloud build as more instances are up and running than
required. But slos that configuring Kubernetes apps
for reliability and cost efficiency is actually a tricky process.
Let's now have a look at the best configuration identified by ML
with respect to the defined cost efficiency goal.
This was found at experiment number 34 after
about 19 hours and almost half the cost of
running the application with respect to the baseline.
First of all, it's interesting to notice how our
MLbase optimization increased both memory and cpu
results and limits, which is not at all obvious and
may seem at first counterintuitive, especially as Kubernetes is
often considered well suited for small and highly scalable
applications. The other notable changes are related to the
JVM options. The max cheap size was increased by 20% and is
now well within the container memory request, which was increased
to five gigabyte. The min heap size has also
been adjusted to be almost equal to the max cheap, which is a configuration
that can avoid garbage collection cycles, especially in the startup
phase of JVM. So let's now see how
the application performs with the new configuration identified by ML
and how it compares with respect to the baseline. There are two important differences
here. Results time always remain within the Hasselo
and there are no more picks. So these configuration got only improves
on cost, but it's also beneficial in terms of performance
and resilience. Autoscaling is not triggering these
configuration as the full load is sustained by just one pod.
These is clearly beneficial in terms of costs.
Let's also compare in detail the best configuration with respect to
the baseline. These we can notice that the pod is significantly
larger in terms of both cpu and memory, especially for
the requests. This configuration has the effect of triggering the auto
scaler less often, as we have seen, but interestingly and
somewhat counterintuitively, while this implies a kind of a
fixed cost considering the prices of the container resource,
it turns out being much cheap than a configuration where autoscaling
is triggered, and this also avoids performance issues.
The container and runtime configuration are now better aligned.
The JVM max is now below the memory request and
has a beneficial effect as it also enables the scaled down
of the application should the scaling be triggered by higher loads.
Let's now have a look at another configuration found by ML
at experiment number 14. After about 8 hours
of automated optimization, we leveled this configuration high
reliability for a reason that we will be clear in a minute.
The score for this configuration, while not as good as the best configurations,
also provided about 60% cost reduction.
So this can be considered also an interesting configuration with
respect to the cost efficiency goal as regards to the parameters.
What is worth noticing is that this time ML picked
settings that significantly change the shape of the container.
It now has a much smaller cpu request with respect to the
baseline, but the memory is still pretty large, which is
pretty interesting. The JVM options were
also changed. In particular, the Galbridge collector was
switched to parallel, which is a collector that can be much
more efficient on the use of cpu and memory.
Let's compare the behavior of this configuration with respect to the baseline.
There are two important differences here. The peak on the response
time upon the scaling out is significantly lower. It's still
higher than the response time slo. However, the peak is
less than half the value of the baseline configuration.
This clearly improves the service resilience.
Autoscaling works properly after the high load phase
replicarves are scaled back to one. Its behavior is what we expect from
an autoscaling system that works properly. Notice the response time
picks could also be further reduced. It would simply
be a matter of creating a new optimization with the goal of minimizing
these response time matrix instead of the application cost.
Let's now also companies in detail the high resilience configuration
with respect to the baseline. Quite interestingly,
these configuration has a higher memory request and lower cpu results,
but higher limits than the baseline. As you may remember,
the lowest cost configuration instead had a higher cpu request than
the baseline. Without getting into much details in the analysis
of this specific configuration, what these facts show is
that as the optimization goal changes, cpu and memory
results and limits may need to be increased or decreased.
That multiple parameters at Kubernetes and JVM
levels also need to be tune accordingly.
This is a clear confirmation of the perceived complexity of
tuning Kubernetes microservices application,
as here we are just discussing one microservice out of hundreds
or more of today applications. There are many other interesting
configurations found by ML that we would like to discuss,
but I think it's time to conclude with our takeaways.
Our first takeaway is that when tuning the modern applications,
the interplay between different application layers and technologies require
tuning the full stack configuration to make sure
that both the optimization goal and slos are
matched, as we've seen in our real world example. A second takeaway
is that the complexity of this application under varied workloads and
in a context of frequent releases with agile practices
requires a continuous performance tuning process. Developers cannot
simply rely on manual tuning or utilization based autoscaling
mechanisms. Finally, in order to explore
the vastness of the space of possible configuration in a cost
and time efficient way, it's mandatory to leverage Mlbased methods
that can automatically converge to optimal configuration within hours
without requiring deep knowledge of all the underlying technologies.
Many thanks for your time. I hope you enjoyed the talk. Please reach
out to me if you have found this talk interesting. I would love to share
more details and hear your Kubernetes challenges.