Abstract
            
SREs’ main goal is to achieve optimal application performance, efficiency and availability. A crucial role is played by configurations (e.g. JVM and DBMS settings, container CPU and memory, etc): wrong settings can cause poor performance and incidents. But tuning configurations is a manual and lengthy task, as there are 100s of settings in the stack all interacting in counterintuitive ways.
In this talk, we present a new approach that leverages machine learning to find optimal configurations of the tech stack. The optimization process is automated and driven by performance goals and constraints that SREs can define (e.g. minimize resource footprint while matching latency and throughput SLOs).
We show examples of optimizing Kubernetes microservices for cost efficiency and latency tuning container sizing and JVM options.
With the help of ML, SREs can achieve higher application performance, in days instead of months, and have a lot of fun in the process!
           
          
          
          
            
              Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Are you an SRE? A developer?
            
            
            
              A quality engineer who wants to tackle the challenge of improving
            
            
            
              reliability in your DevOps? You can enable your DevOps for
            
            
            
              reliability with chaos native. Create your free account
            
            
            
              at Chaos native. Litmus Cloud hi
            
            
            
              and welcome to my talk. The topic of this session is about
            
            
            
              a new approach SRE can use to automate performance tuning of system
            
            
            
              configurations leveraging machine learning techniques we
            
            
            
              will start by discussing the SRE challenges for ensuring
            
            
            
              application performance and reliability. We will then introduce
            
            
            
              a new approach based on machine learning techniques, and then we will see
            
            
            
              how the new approach can be used to automatically tune today's complex
            
            
            
              applications by focusing on optimizing a microservice application
            
            
            
              running on Kubernetes and the Java virtual machine. Finally,
            
            
            
              we will conclude by sharing some takeaways on how this approach benefits sres
            
            
            
              teams. Before proceeding, allow me to introduce myself.
            
            
            
              My name is Stefano Doni and my role is CTO at Akamas.
            
            
            
              Performance engineering has always been my topic of interest and passion,
            
            
            
              so let's start by discussing why system configuration are so important
            
            
            
              to the job and mission of SREs. I would like to
            
            
            
              start by reviewing at the high level what really is the job of sres,
            
            
            
              citing the seminal SRES book by Google. It's no surprise
            
            
            
              that any SRE deeply cares about service performance,
            
            
            
              efficiency and reliability to understand in practice.
            
            
            
              What does that mean? Let's have a look at some of the core tenets
            
            
            
              pursuing maximum change velocity without violating
            
            
            
              slos. The tenet is well known and refers to
            
            
            
              the goal for sres to accelerate product innovation and releases whilst
            
            
            
              in matching agreed service level objectives.
            
            
            
              For example, ensuring request latency stays below a target value.
            
            
            
              Demand forecasting and capacity planning s three should have
            
            
            
              a clear understanding of what is the capacity of the service in terms
            
            
            
              of maximum load that can be supported. This is typically achieved
            
            
            
              via load testing to determine the right resources
            
            
            
              to provision to the service, for example before a product launch.
            
            
            
              Efficiency and performance efficiency means being
            
            
            
              able to achieve the target load with the required response time
            
            
            
              using the least amount of resources, which means lower cost sres
            
            
            
              and developers can change service software to improve its efficiency.
            
            
            
              And here is a major area where configuration can provide big gains,
            
            
            
              as we are going to see in a moment. So what is the role of
            
            
            
              system configurations and why are they so important for sres?
            
            
            
              System configuration are key for sres as they can really impact service
            
            
            
              performance, efficiency and reliability. Let's consider a
            
            
            
              real world example related to a Java service.
            
            
            
              After a JVM reconfiguration on day three, the cpu usage
            
            
            
              of the application service dramatically changed a 75%
            
            
            
              reduction while the system was still being
            
            
            
              able to support the same traffic load. So this configuration
            
            
            
              change translated into significant gains on service efficiency.
            
            
            
              On the right, you have another example where an optimal configuration significantly
            
            
            
              improved service resiliency. While the
            
            
            
              baseline configuration crashed under load, the best configuration
            
            
            
              supported the target load with much greater stability.
            
            
            
              So tuning system configurations really matters. And this
            
            
            
              does not apply only to Java. It also applies to all the infrastructure
            
            
            
              levels, from databases to container middleware
            
            
            
              and application servers to any other technology.
            
            
            
              But how easy it is to do that for modern applications,
            
            
            
              well, not so much. Actually, the contrary. The problem
            
            
            
              is that performance optimization is getting harder and harder these days.
            
            
            
              There are three key factors that I would like to highlight here.
            
            
            
              First of all, explosion of configurations parameters
            
            
            
              today, a JVM has more than 800 parameters,
            
            
            
              a MySQL database more than 500, and on
            
            
            
              the cloud AWS just added plus 100,
            
            
            
              etc. Two instance types just last year.
            
            
            
              Second, configurations can have unpredictable effects. In this
            
            
            
              example, allocating less memory to a database, a MongoDB database,
            
            
            
              made it run much faster. Third, with its velocity
            
            
            
              is increasing, this is well known for sres.
            
            
            
              It's worth mentioning that it has negative impacts on performance
            
            
            
              tuning also, which is largely a manual task but done
            
            
            
              by experts, and as such it takes a lot of time.
            
            
            
              So system configurations can play a big role on service performance
            
            
            
              and reliability. But optimizing them is a daunting
            
            
            
              task for sres, given increasing complexity of technology
            
            
            
              and modern delivery processes. This is why we
            
            
            
              think we need a new approach. At Akamas, we have
            
            
            
              identified four key requirements that we believe are needed to
            
            
            
              effectively solve this class of problems.
            
            
            
              First, the new approach should allow a full stack optimization,
            
            
            
              meaning that it has to be done to be able to optimize several
            
            
            
              different technologies at the same time.
            
            
            
              For example, the optimization might focus on container parameters and
            
            
            
              runtime selectings like the JVM options at the same time.
            
            
            
              Second, the new approach CTO allow a smart exploration of the
            
            
            
              space of potential configurations. If we consider
            
            
            
              all the possible values for all the possible parameters, we really have
            
            
            
              billions of configurations. So a brute force approach is
            
            
            
              simply not feasible and would not work for this kind of problem.
            
            
            
              To be effective, a solution needs to be identified in a reasonable time
            
            
            
              frame and at an acceptable cost. Third,
            
            
            
              the new approach should allow us to align to each specific need,
            
            
            
              priority and use case of interest by defining custom optimization
            
            
            
              goals. For example, in some cases the SREs may
            
            
            
              want to reduce the service latency and provide the best possible performance
            
            
            
              at peak loads. In other cases, reducing costs will
            
            
            
              be the key driver for the optimization without violating a service
            
            
            
              level objectives. And last but not least,
            
            
            
              the new approach should be fully automated to reduce toil as
            
            
            
              much as possible and ensure a reliable process with the fastened
            
            
            
              convergence to the optimal solution. So this figure
            
            
            
              shows the five phases of the automated optimization
            
            
            
              process. The first step is to apply a configuration to
            
            
            
              the target service of the optimizing. This can be a new JVM
            
            
            
              option or a new Kubernetes container resource settings,
            
            
            
              for example. The second step is to apply a workload to the
            
            
            
              target system so as to assess the impact of
            
            
            
              the apply configuration, typically by integrating
            
            
            
              with load testing tools. The first step is CTO collect key
            
            
            
              performance indicators related to the target system,
            
            
            
              that is, information, data and metrics about the behavior of the
            
            
            
              system under the workload. This step typically leverages
            
            
            
              existing monitoring tools. The fourth step is to score the
            
            
            
              configuration against the specified goal by leverages
            
            
            
              the collected KPIs. This step is where the
            
            
            
              system performance metrics are fed back to the machine learning optimizing,
            
            
            
              that is, scoring the tested configuration against the goals.
            
            
            
              The last step is where machine learning kicks in by taking this score as
            
            
            
              input and producing as output the most optimizing configuration to be tested
            
            
            
              in the next iteration of the same process. In a relatively
            
            
            
              short amount of time, the machine learning algorithms learns the dependencies
            
            
            
              between the configuration and the system behavior, thus identifying
            
            
            
              better and better configurations. An important aspect is what
            
            
            
              is the role of SRE in this new AI driven
            
            
            
              optimization process? SRES has a key role in the new automated
            
            
            
              optimization process as it defines the optimization goals
            
            
            
              and constraints and then lets the machine learning based optimization
            
            
            
              to automatically identify the optimal configurations.
            
            
            
              Let's see how this approach applies to a real world example
            
            
            
              where we will automate the optimization of a service by tuning Kubernetes
            
            
            
              and the JVM options. In our example,
            
            
            
              the target system is Google Online boutique, a cloud native
            
            
            
              application running on Kubernetes made of ten microservices.
            
            
            
              This application features a modern software stack with services
            
            
            
              written in Golang, node, JS, Java and Python, and it
            
            
            
              also includes a load generator based on locust, which generates realistic
            
            
            
              traffic. To test the application, we will leverages this application
            
            
            
              to illustrate two different but related use cases.
            
            
            
              The first use case is related to tuning Kubernetes,
            
            
            
              cpu and memory limits that define how many resources
            
            
            
              are assigned to a Kubernetes container to ensure application performance,
            
            
            
              cluster efficiency and stability. For SRE, the challenge
            
            
            
              is represented by the need to ensure that the overall service will sustain
            
            
            
              the target load while also matching the defined slos
            
            
            
              on response time and error rate. Service efficiency is
            
            
            
              also very important. In this case, we want CTO minimize the overall
            
            
            
              cost so that we want to assign more resources that are actually
            
            
            
              needed. Also, as a srEs, we want to keep
            
            
            
              operational, toil minimal and stay aligned to the delivery milestones.
            
            
            
              So this is the overall architecture. The parameter
            
            
            
              that we are optimizing is the example are the cpu and memory limits
            
            
            
              of each single microservice. These parameters
            
            
            
              are applied via the Kubernetes APIs in
            
            
            
              order to assess the performance of the overall application with respect to the different
            
            
            
              sizes of the containers. We leverage the built in locust load injector.
            
            
            
              We then leverage Prometheus and the ISIO service mesh, respectively to
            
            
            
              gather pod resource consumption metrics and service level
            
            
            
              metrics. The whole optimization process is completely automated.
            
            
            
              So what is the goal of this optimization? In our example,
            
            
            
              we aim at increasing the efficiency of the online boutique application,
            
            
            
              that is, to both increase the service throughput and decrease
            
            
            
              the cloud cost. At the same time, we want our service
            
            
            
              to always meet its reliability targets, which are expressed
            
            
            
              as latency and error rate slos.
            
            
            
              Therefore, our optimization goal is to maximize the ratio between
            
            
            
              service throughput and the overall service cost.
            
            
            
              Service throughput is measured by ISTIO at the front end
            
            
            
              layer where all user traffic is processed.
            
            
            
              The service cost is the cloud cost that we will pay
            
            
            
              to run the application on the cloud considering the cpu and memory
            
            
            
              resources allocated to each microservices.
            
            
            
              In this example, we cause the pricing of AWS Fargate,
            
            
            
              which is a serverless Kubernetes offering by AWS,
            
            
            
              which charges $29 per month for each cpu requested
            
            
            
              and about $3 per month for each gigabyte of memory.
            
            
            
              The machine learning algorithm that we have implemented at akamas also
            
            
            
              allow to set constraint on which configuration are
            
            
            
              acceptable. In this case, we state that configuration
            
            
            
              should have a maximum 90 percentile latency of 500
            
            
            
              milliseconds and a error rate lower than 2%.
            
            
            
              That is it. At this point, the machine learning based optimization can
            
            
            
              run automatically while we are relaxing or doing some other important
            
            
            
              stuff. Let's see what results we got. In this
            
            
            
              chart, each dot represents the cost efficiency of the application,
            
            
            
              which is the service throughput divided by the cloud cost
            
            
            
              as a result of different configuration of microservices,
            
            
            
              cpu and memory limits as chosen by the machine learning
            
            
            
              optimizer. The result of the machine learning based optimization
            
            
            
              is quite astonishing. A configuration improving the cost
            
            
            
              efficiency by 77% was automatically identified
            
            
            
              in about 24 hours. The baseline configuration,
            
            
            
              which correspond to the initial sizing of the microservices,
            
            
            
              has a cost efficiency of 00:29 transaction
            
            
            
              per second per dollar per month the best configuration ML
            
            
            
              machine learning found by tuning microservice resources achieved
            
            
            
              00:52 transaction per second per dollar
            
            
            
              per month. This chart also helps understanding
            
            
            
              how our machine learning based optimization works by learning
            
            
            
              from each tested configuration to quickly converge toward the
            
            
            
              optimal configurations. At this point, I guess
            
            
            
              you may want to know what is this best configuration looks like?
            
            
            
              Let's inspect the best configuration by focusing on the ten most
            
            
            
              important parameters. It's interesting to notice
            
            
            
              how our machine learning based optimization reduced cpu
            
            
            
              limits for many microservices. This is clearly a
            
            
            
              winning move from a cost perspective. However,
            
            
            
              it also automatically learned that two particular microservices,
            
            
            
              the product catalog and the recommendation, were under
            
            
            
              provision. Thus, it increased both their assigned cpu
            
            
            
              and memory. All these changes at the microservice level were critical
            
            
            
              to achieve our optimization goals that we set on the overall higher
            
            
            
              level service, which is to maximize the throughput and lower
            
            
            
              cost while still matching the slos. Let's now
            
            
            
              see how the overall service performance changes when the best configuration
            
            
            
              is applied. The chart on the left shows how the base and
            
            
            
              best configurations compare in terms of service throughput
            
            
            
              interacting. Besides being much more cost efficient,
            
            
            
              the best configuration also improved throughput by 19%.
            
            
            
              On the right, we are comparing the service 90 percentile
            
            
            
              response time. The best configuration cut the latency
            
            
            
              peaks by 60% and made the service latency
            
            
            
              much more stable. Let's now consider another use
            
            
            
              case of the same target system. Here the
            
            
            
              SRes challenge is how to tune the JVM options of the critical
            
            
            
              microservice so as to ensure the required service can support
            
            
            
              a high target load, a higher target load that is expected
            
            
            
              while still matching the defined slos. The container size
            
            
            
              is not changed here as we don't want to increase cost
            
            
            
              again, we want to keep operational to a minimum and be quick to
            
            
            
              stay aligned to product launch milestones. As mentioned,
            
            
            
              the target architecture is the same with respect
            
            
            
              to the previous picture. Here we are emphasizing the specific JVM options,
            
            
            
              about 32 that are taking into account, and the specific microservice,
            
            
            
              the ad service that is being targeted.
            
            
            
              Let's see how the optimization goal and constraints are defined in this case.
            
            
            
              The goal for this use case is CTO increase the number of successful transaction processed
            
            
            
              by the Ad microservice. The SLO for this service
            
            
            
              is the average response time which should be kept no higher than 100 millisecond.
            
            
            
              Let's see the results machines learning achieved for this optimization.
            
            
            
              This chart showed the service throughput and response time under increasing
            
            
            
              level of load during a load test. First of all,
            
            
            
              consider the blue and green lines representing the microservices
            
            
            
              throughput and response time for the baseline configuration.
            
            
            
              As you can see, the baseline configuration reaches 74
            
            
            
              transaction per second before violating the 100 milliseconds
            
            
            
              low on response time. Let's now look at the best configuration.
            
            
            
              The throughput, which is the black line, reaches 95
            
            
            
              transaction per seconds before violating the response time slos.
            
            
            
              Therefore, the best configurations identified by machine learning power optimization
            
            
            
              provides a 28% increase in transaction per second
            
            
            
              while also meeting the defined survey level objectives.
            
            
            
              So what makes this configuration so good?
            
            
            
              The maximum heap memory was increased at 250%
            
            
            
              increase. Actually. It's also interesting to notice that the garbage collection
            
            
            
              type was changed from g one to parallel in this case,
            
            
            
              and machines learning also decreased the number of garbage collection threads
            
            
            
              from a to three as we got the parallel garbage collection thread,
            
            
            
              and from a to one as we got the concurrent ones. Machine loading also
            
            
            
              adjusted the IP regions and the object aging threshold to maximize
            
            
            
              performance. As for the previous case, we also
            
            
            
              got eight JVM options automatically selected as the most impactful
            
            
            
              with respect to the dozens and dozens, hundreds actually of potential
            
            
            
              JVM options to be considered. All right,
            
            
            
              it is time to conclude with our takeaways. Our first
            
            
            
              takeaway is that tuning modern application is a complex problem that
            
            
            
              is hard to solve. Any traditional tuning approach cause a relevant
            
            
            
              toil for SRE teams. A second takeaway is
            
            
            
              that when our new approach is based on fully automated
            
            
            
              machine learning based optimization, it becomes possible for SRE
            
            
            
              to really ensure that application with have higher performance and
            
            
            
              reliability. And third, that this new
            
            
            
              approach also make it possible to reduce the operational toys and
            
            
            
              stay aligned to release milestone. A huge improvement for rolexer is
            
            
            
              many thanks for your time. I hope you enjoyed the talk.
            
            
            
              Please send me any comment and question by leverages the conference
            
            
            
              discard channels or these contacts.