Abstract
SREs’ main goal is to achieve optimal application performance, efficiency and availability. A crucial role is played by configurations (e.g. JVM and DBMS settings, container CPU and memory, etc): wrong settings can cause poor performance and incidents. But tuning configurations is a manual and lengthy task, as there are 100s of settings in the stack all interacting in counterintuitive ways.
In this talk, we present a new approach that leverages machine learning to find optimal configurations of the tech stack. The optimization process is automated and driven by performance goals and constraints that SREs can define (e.g. minimize resource footprint while matching latency and throughput SLOs).
We show examples of optimizing Kubernetes microservices for cost efficiency and latency tuning container sizing and JVM options.
With the help of ML, SREs can achieve higher application performance, in days instead of months, and have a lot of fun in the process!
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE? A developer?
A quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps for
reliability with chaos native. Create your free account
at Chaos native. Litmus Cloud hi
and welcome to my talk. The topic of this session is about
a new approach SRE can use to automate performance tuning of system
configurations leveraging machine learning techniques we
will start by discussing the SRE challenges for ensuring
application performance and reliability. We will then introduce
a new approach based on machine learning techniques, and then we will see
how the new approach can be used to automatically tune today's complex
applications by focusing on optimizing a microservice application
running on Kubernetes and the Java virtual machine. Finally,
we will conclude by sharing some takeaways on how this approach benefits sres
teams. Before proceeding, allow me to introduce myself.
My name is Stefano Doni and my role is CTO at Akamas.
Performance engineering has always been my topic of interest and passion,
so let's start by discussing why system configuration are so important
to the job and mission of SREs. I would like to
start by reviewing at the high level what really is the job of sres,
citing the seminal SRES book by Google. It's no surprise
that any SRE deeply cares about service performance,
efficiency and reliability to understand in practice.
What does that mean? Let's have a look at some of the core tenets
pursuing maximum change velocity without violating
slos. The tenet is well known and refers to
the goal for sres to accelerate product innovation and releases whilst
in matching agreed service level objectives.
For example, ensuring request latency stays below a target value.
Demand forecasting and capacity planning s three should have
a clear understanding of what is the capacity of the service in terms
of maximum load that can be supported. This is typically achieved
via load testing to determine the right resources
to provision to the service, for example before a product launch.
Efficiency and performance efficiency means being
able to achieve the target load with the required response time
using the least amount of resources, which means lower cost sres
and developers can change service software to improve its efficiency.
And here is a major area where configuration can provide big gains,
as we are going to see in a moment. So what is the role of
system configurations and why are they so important for sres?
System configuration are key for sres as they can really impact service
performance, efficiency and reliability. Let's consider a
real world example related to a Java service.
After a JVM reconfiguration on day three, the cpu usage
of the application service dramatically changed a 75%
reduction while the system was still being
able to support the same traffic load. So this configuration
change translated into significant gains on service efficiency.
On the right, you have another example where an optimal configuration significantly
improved service resiliency. While the
baseline configuration crashed under load, the best configuration
supported the target load with much greater stability.
So tuning system configurations really matters. And this
does not apply only to Java. It also applies to all the infrastructure
levels, from databases to container middleware
and application servers to any other technology.
But how easy it is to do that for modern applications,
well, not so much. Actually, the contrary. The problem
is that performance optimization is getting harder and harder these days.
There are three key factors that I would like to highlight here.
First of all, explosion of configurations parameters
today, a JVM has more than 800 parameters,
a MySQL database more than 500, and on
the cloud AWS just added plus 100,
etc. Two instance types just last year.
Second, configurations can have unpredictable effects. In this
example, allocating less memory to a database, a MongoDB database,
made it run much faster. Third, with its velocity
is increasing, this is well known for sres.
It's worth mentioning that it has negative impacts on performance
tuning also, which is largely a manual task but done
by experts, and as such it takes a lot of time.
So system configurations can play a big role on service performance
and reliability. But optimizing them is a daunting
task for sres, given increasing complexity of technology
and modern delivery processes. This is why we
think we need a new approach. At Akamas, we have
identified four key requirements that we believe are needed to
effectively solve this class of problems.
First, the new approach should allow a full stack optimization,
meaning that it has to be done to be able to optimize several
different technologies at the same time.
For example, the optimization might focus on container parameters and
runtime selectings like the JVM options at the same time.
Second, the new approach CTO allow a smart exploration of the
space of potential configurations. If we consider
all the possible values for all the possible parameters, we really have
billions of configurations. So a brute force approach is
simply not feasible and would not work for this kind of problem.
To be effective, a solution needs to be identified in a reasonable time
frame and at an acceptable cost. Third,
the new approach should allow us to align to each specific need,
priority and use case of interest by defining custom optimization
goals. For example, in some cases the SREs may
want to reduce the service latency and provide the best possible performance
at peak loads. In other cases, reducing costs will
be the key driver for the optimization without violating a service
level objectives. And last but not least,
the new approach should be fully automated to reduce toil as
much as possible and ensure a reliable process with the fastened
convergence to the optimal solution. So this figure
shows the five phases of the automated optimization
process. The first step is to apply a configuration to
the target service of the optimizing. This can be a new JVM
option or a new Kubernetes container resource settings,
for example. The second step is to apply a workload to the
target system so as to assess the impact of
the apply configuration, typically by integrating
with load testing tools. The first step is CTO collect key
performance indicators related to the target system,
that is, information, data and metrics about the behavior of the
system under the workload. This step typically leverages
existing monitoring tools. The fourth step is to score the
configuration against the specified goal by leverages
the collected KPIs. This step is where the
system performance metrics are fed back to the machine learning optimizing,
that is, scoring the tested configuration against the goals.
The last step is where machine learning kicks in by taking this score as
input and producing as output the most optimizing configuration to be tested
in the next iteration of the same process. In a relatively
short amount of time, the machine learning algorithms learns the dependencies
between the configuration and the system behavior, thus identifying
better and better configurations. An important aspect is what
is the role of SRE in this new AI driven
optimization process? SRES has a key role in the new automated
optimization process as it defines the optimization goals
and constraints and then lets the machine learning based optimization
to automatically identify the optimal configurations.
Let's see how this approach applies to a real world example
where we will automate the optimization of a service by tuning Kubernetes
and the JVM options. In our example,
the target system is Google Online boutique, a cloud native
application running on Kubernetes made of ten microservices.
This application features a modern software stack with services
written in Golang, node, JS, Java and Python, and it
also includes a load generator based on locust, which generates realistic
traffic. To test the application, we will leverages this application
to illustrate two different but related use cases.
The first use case is related to tuning Kubernetes,
cpu and memory limits that define how many resources
are assigned to a Kubernetes container to ensure application performance,
cluster efficiency and stability. For SRE, the challenge
is represented by the need to ensure that the overall service will sustain
the target load while also matching the defined slos
on response time and error rate. Service efficiency is
also very important. In this case, we want CTO minimize the overall
cost so that we want to assign more resources that are actually
needed. Also, as a srEs, we want to keep
operational, toil minimal and stay aligned to the delivery milestones.
So this is the overall architecture. The parameter
that we are optimizing is the example are the cpu and memory limits
of each single microservice. These parameters
are applied via the Kubernetes APIs in
order to assess the performance of the overall application with respect to the different
sizes of the containers. We leverage the built in locust load injector.
We then leverage Prometheus and the ISIO service mesh, respectively to
gather pod resource consumption metrics and service level
metrics. The whole optimization process is completely automated.
So what is the goal of this optimization? In our example,
we aim at increasing the efficiency of the online boutique application,
that is, to both increase the service throughput and decrease
the cloud cost. At the same time, we want our service
to always meet its reliability targets, which are expressed
as latency and error rate slos.
Therefore, our optimization goal is to maximize the ratio between
service throughput and the overall service cost.
Service throughput is measured by ISTIO at the front end
layer where all user traffic is processed.
The service cost is the cloud cost that we will pay
to run the application on the cloud considering the cpu and memory
resources allocated to each microservices.
In this example, we cause the pricing of AWS Fargate,
which is a serverless Kubernetes offering by AWS,
which charges $29 per month for each cpu requested
and about $3 per month for each gigabyte of memory.
The machine learning algorithm that we have implemented at akamas also
allow to set constraint on which configuration are
acceptable. In this case, we state that configuration
should have a maximum 90 percentile latency of 500
milliseconds and a error rate lower than 2%.
That is it. At this point, the machine learning based optimization can
run automatically while we are relaxing or doing some other important
stuff. Let's see what results we got. In this
chart, each dot represents the cost efficiency of the application,
which is the service throughput divided by the cloud cost
as a result of different configuration of microservices,
cpu and memory limits as chosen by the machine learning
optimizer. The result of the machine learning based optimization
is quite astonishing. A configuration improving the cost
efficiency by 77% was automatically identified
in about 24 hours. The baseline configuration,
which correspond to the initial sizing of the microservices,
has a cost efficiency of 00:29 transaction
per second per dollar per month the best configuration ML
machine learning found by tuning microservice resources achieved
00:52 transaction per second per dollar
per month. This chart also helps understanding
how our machine learning based optimization works by learning
from each tested configuration to quickly converge toward the
optimal configurations. At this point, I guess
you may want to know what is this best configuration looks like?
Let's inspect the best configuration by focusing on the ten most
important parameters. It's interesting to notice
how our machine learning based optimization reduced cpu
limits for many microservices. This is clearly a
winning move from a cost perspective. However,
it also automatically learned that two particular microservices,
the product catalog and the recommendation, were under
provision. Thus, it increased both their assigned cpu
and memory. All these changes at the microservice level were critical
to achieve our optimization goals that we set on the overall higher
level service, which is to maximize the throughput and lower
cost while still matching the slos. Let's now
see how the overall service performance changes when the best configuration
is applied. The chart on the left shows how the base and
best configurations compare in terms of service throughput
interacting. Besides being much more cost efficient,
the best configuration also improved throughput by 19%.
On the right, we are comparing the service 90 percentile
response time. The best configuration cut the latency
peaks by 60% and made the service latency
much more stable. Let's now consider another use
case of the same target system. Here the
SRes challenge is how to tune the JVM options of the critical
microservice so as to ensure the required service can support
a high target load, a higher target load that is expected
while still matching the defined slos. The container size
is not changed here as we don't want to increase cost
again, we want to keep operational to a minimum and be quick to
stay aligned to product launch milestones. As mentioned,
the target architecture is the same with respect
to the previous picture. Here we are emphasizing the specific JVM options,
about 32 that are taking into account, and the specific microservice,
the ad service that is being targeted.
Let's see how the optimization goal and constraints are defined in this case.
The goal for this use case is CTO increase the number of successful transaction processed
by the Ad microservice. The SLO for this service
is the average response time which should be kept no higher than 100 millisecond.
Let's see the results machines learning achieved for this optimization.
This chart showed the service throughput and response time under increasing
level of load during a load test. First of all,
consider the blue and green lines representing the microservices
throughput and response time for the baseline configuration.
As you can see, the baseline configuration reaches 74
transaction per second before violating the 100 milliseconds
low on response time. Let's now look at the best configuration.
The throughput, which is the black line, reaches 95
transaction per seconds before violating the response time slos.
Therefore, the best configurations identified by machine learning power optimization
provides a 28% increase in transaction per second
while also meeting the defined survey level objectives.
So what makes this configuration so good?
The maximum heap memory was increased at 250%
increase. Actually. It's also interesting to notice that the garbage collection
type was changed from g one to parallel in this case,
and machines learning also decreased the number of garbage collection threads
from a to three as we got the parallel garbage collection thread,
and from a to one as we got the concurrent ones. Machine loading also
adjusted the IP regions and the object aging threshold to maximize
performance. As for the previous case, we also
got eight JVM options automatically selected as the most impactful
with respect to the dozens and dozens, hundreds actually of potential
JVM options to be considered. All right,
it is time to conclude with our takeaways. Our first
takeaway is that tuning modern application is a complex problem that
is hard to solve. Any traditional tuning approach cause a relevant
toil for SRE teams. A second takeaway is
that when our new approach is based on fully automated
machine learning based optimization, it becomes possible for SRE
to really ensure that application with have higher performance and
reliability. And third, that this new
approach also make it possible to reduce the operational toys and
stay aligned to release milestone. A huge improvement for rolexer is
many thanks for your time. I hope you enjoyed the talk.
Please send me any comment and question by leverages the conference
discard channels or these contacts.