Empowering AWS DevOps with Python and Machine Learning: Real-Life Applications for Cloud Infrastructure Optimization

Video size:

Abstract

Unlock the full potential of AWS DevOps with Python and Machine Learning! Join Gustavo Amigo, Senior Software Engineer at Prime Video, as he shares real-life examples of optimizing autoscaling, alarming, and cache configurations. Elevate your cloud infrastructure game with data-driven precision.

Summary

Gustavo Beveramigo is a senior software engineer at Amazon Prime Video. He will talk about how we can use data science and machine learning to work with AWS DevOps. If you manage a service deployed to AWS, this presentation is for you.
We're going to be using pandas for data manipulation, Jupyter for notebooking, locust for load testing. The target audience are software engineers, software reliability engineers, and DevOps. If you know Python and AWS, this should be enough for you to understand this presentation.
Scaling a service is really hard. It's really a lot of work to understand if your service can scale. And the other thing is that infrastructure is expensive. So you need to be really resourceful in the resources that you use.
In this talk, I'm going to present you a problem, a problem that I've seen multiple times in my career. You have a service that scales dynamically. How we can set up our infrastructure in a way that it will scale at the right velocity and using the right resources at theright time without having problems.
After 50% of cputonization, my application doesn't work. I need to keep my application below 50%. If it's go above 50%, things will not work correctly. So what can we do? We can test multiple parameters in production. It's doable, but it's risky.
The reason that the CPU station is so high is because here at this point here, we are already receiving a lot of traffic. So basically, our arp threshold of 25% and increment in one host at a time, it doesn't work. Some teams decide to abandon the outscaling the dynamic scaling strategy.
In a simulation, it seems to be a little bit slower than what happens in real life. But when I actually run the simulator against those parameters, things look a lot better. It does spin off hosts faster. And I apply those parameters to production.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Welcome to my presentation. Today I'm going to be talking about how we can use data science and machine learning to work with AWS DevOps. Right? If you manage a service deployed to AWS, I think this presentation is for you. So let me start presenting myself, introducing myself. My name is Gustavo Beveramigo. I'm a senior software engineer at Amazon Prime Video. Before Amazon, I was working with startups and I've been doing this work for quite a long time. And I've been introducing on my daily basis Python machine learning and all of its tooling to manage those infrastructures. So let's start here. Okay, the agenda for today we're going to be talking about how can we use Python and its tooling to fine tune your service to perfection. We're going to be using pandas for data manipulation, Jupyter for notebooking, locust for load testing. It's a python tool for load testing, and both of three to access our AWS account. Right. The target audience are software engineers, software reliability engineers, and DevOps. Right. And I'm assuming that you are familiar with pandas a little bit for data manipulation and also with Jupyter. But don't worry if you don't know them that much. If you know Python and AWS, I think this should be enough for you to understand this presentation. All right, first of all, I think the first thing that I want to talk about is related to scaling. I think the first part of scaling, scaling a service is really hard. I've been doing this for years and it's always hard. It doesn't matter if it's a small scale or if it's a big scale, right, as you have at Amazon or on big tags or if you're in startup, it's always hard. And the reason that it's hard is because testing scalability is really hard, right. It's really a lot of work to understand if your service can scale and when you have scaling issues, it's a lot of work to fix that. The other thing that makes autoscaling hard is that it's really hard to reason about scaling to understand what are the bottlenecks and how we can make your service scale is something that is really not easy. And the other thing is that infrastructure is expensive. So if infrastructure was cheap or cheap, you could just scale the number of resources you're using and that would be fine. But that infrastructure is really expensive. So you need to be really resourceful in the resources that you use so it doesn't get more expensive than it needs to be. So I'm going to be talking a little bit about this. In this talk, first, I'm going to present you a problem, a problem that I've seen multiple times in my career, all right? Which is you have a service scales dynamically. What I mean by that is that you have multiple hosts, right? You have like one host here, another host here, another host here, and use your cloud resources, depending on the traffic that you're receiving. You want to add more hosts, right? And you do like, we call that dynamically scaling or sometimes autoscaling, right. The problem here is that how we can do this outscaling, how we can set up our infrastructure in a way that it will scale at the right velocity and using the right resources at the right time without having problems. Right. What I mean by problems without your services being unavailable and how we can do that in efficient ways. So if you're too conservative, you're going to be adding more hosting than necessary. If you're too aggressive, you won't add hosts at the right time and at the right scale. So the way that I set up this problem here, I set up an infrastructure just for this talk where I will do a scaling experiment, right? And we're going to be using python to analyze this problem and try to come up with a solution. So that's my approach to this. So let's say we have this infrastructure here which scales dynamically. And let's say that we have a traffic that increases in this space here, right? It starts like after ten minutes, scales to 30,000 requests per minute. And then you have a search here that in eight minutes it will scale from 30,000 rpm to close to 150,000 requests per minute. This is a very aggressive ramping up, and that does happen in real life. All right, so that's not something that we haven't seen that does happen. Right. If you work with retail, for example, ecommerce, we have that on Black Fridays or when you have promotions and things like that, or a marketing campaign, it's not uncommon to see this kind of shape. So the way that we usually work, we prepare our infrastructure and our service in order to handle that kind of search. Sometimes this search happens unexpectedly. Right. All the seminar, there's like a promotion that was sent out by the marketing team and your team was not prepared for that, but your service needs to scale to that amount of traffic in a way that you won't have any problem, so you won't have unavailability issues. So that's the problem that I'm presenting to you guys. And that's the problem that I'm going to try to solve here using Python machine learning data science tools like the tools that we use usually for other types of problems. I'm going to be using this for this kind of problem. A little bit of the context. I deployed application to AWS Pinstalk basically under the hood it uses application load balancer to load balance those route that prime to EC two hosts. The application is deployed against an EC two host, EC two box. The instance type is a three t three micro which is a very small instance. I created that just for this experiment that we are running here. Also we use autoscaling group to scale that cluster and the scaling parameters that I have set up. This is quite standard like this scale up and scale down implements is one and minus one. That means that every time that scaly event is triggered I'm going to increment that by one host or by minus one host. The upper thread showed is going to be 25%. It's 25% of cpu utilization, right. So when the average cpu utilization across all of your services is above 25% it's going to add one host, right. And then when it's the lower threshold, it's written down here, but it's the lower threshold here it's 15%. This doesn't really matter for our experiment here because we are testing the ramp up time, not how it scales down your fleet. And the metric that we are using is a utilization, right. The other parameters here, like the time that it takes to do this scaling, it waits for five minutes. The CP utilization needs to breach that limit for five minutes. If that happens it will scale the service. It takes like three to four minutes to spin up a new host. Right. So this is our scaling parameters, right. This is quite standard set up. I've seen that multiple times. Usually people use 25%, 30%. That's quite standard. Like that's quite a common sense. Sometimes we change this scale up and scale down increment to try to improve things. That's how it usually works. The approach that I'm going to be using, I'm going to low test one host, right. Because I need an exact profile of how one host behaves when it receives traffic. I'm going to retrieve that data from AWS Cloudwatch to pandas. Right. This is the second line here so I can work with that data. Based on that I can determine what would be the capacity for one host. And based on that I can try to find what would happen when this cluster receives more traffic. How it will scale out, scale out is when you are adding more hosts to handle the additional traffic that you're receiving. And finally, I will try to test the parameters that we have figured out here during our experimentations. I will try that in production, the setup that I have done here. And we're going to analyze the results of that. All right, first of all, I load test one host, right? So I just use this product here. It's called locust. I think that's how it's pronounced. It's an open source tool for Python. It's really easy to use, and I use that just to send traffic to one host, right. And just to check if it can handle how much traffic it can handle. And all that after that, what I did retrieved that data from AWS to pandas. This is how you can do like you can use bottle three to get the metrics right. This is quite AWS. This is a query you create against. This part here is a query you create against, right where you're gaining the utilization. You point into a specific outscaling group and you start like a period of time. You get that and you transform that to a data frame, which is the data type here that we're using with. And one tip that I give people is that once you do that, save that to cvs, to a csv file, because data on cloud watch, they will be erased after a period of time, depending on the period that you're fetching here. If it's 10 seconds, I think it's unavailable only for 24 hours. I think for 1 minute, it's available for a couple of days and so on. So it's not going to be there forever. So if you need that data for future work, do save that to a CSV so you don't lose that data. All right, moving forward here, after I do this, I can just use pandas to actually draw plot graphics here and start analyzing the data that you have. This is pretty much similar to what observability tools you would use. It's not that different. But once you're using pandas, you get access to all the tools that are available in pandas that are really sophisticated. So here we can see the CP utilization and then we see how the request per minute it increases according to the CP utilization. That's quite interesting, but I think we can do a little better than that. So what I did here, I compare the CP utilization against the RPM and I run a linear regression to find, based on the request permitted that I'm receiving what is going to be the CPU utilization. And also based on that, I can create a predictor. So I can predict, based on the request permitted, how much cpu it's going to be optimizing, right? And the other thing that I can find here is that, is this relation linear? Because sometimes if the traffic you're receiving, the supervisation is not linear, you might have a problem, right? So it should be linear, but it's not always linear. So we are validating a hypothesis here, like an assumption here, that it's linear. And now we know the coefficient for this, which means that for 40k requests per minute, you're going to be using about 30% of your cpu. This is for one host, right? That's quite interesting. That's better than what we have. Like now we have an idea of how many hosts we need in order to handle our traffic. There's another thing as well. Like I noticed that response time, the latency here, which is the time that you put the request, how long does it take to get a response? And we see here that while the cputilization is below here, 50, right, in this period, it's pretty normal. It does increase a little bit here, but it's not something that it becomes a worry. But after 50, we notice that things get out of hand, right? We see all these guys here. So my reasoning on here is that after 50% of cputonization, my application doesn't work. So I need to keep my application below 50%. So it's always available. If it's go above 50%, things will not work correctly. So that's a really nice find. Right? And I'm going to be using that as a parameter to scale my service. Right. Sometimes when you do that, some applications is 60%, some of them like 80%, but there's like a cut off percentage here that you need to find out. For this application, it's 50%. It might be related to the instance type, which is very small, maybe. And it's one of those three t three instance. If you're using a C instance, which is meant for CPU intensive application, it might behave differently. It might be related to the application that I've deployed here, which is the same publication from AWS. It's not being sophisticated. So this is how this application works, right? And I'm going to be using that. We haven't actually answered the question how we can scale, but we have some clues here. Like the CPU utilization should be kept below 50%, but we don't know how to do that. Efficiently. We don't know how to keep this CP utilization below 50%. We don't know how to use most of CPU at all times. We don't want a fleet that is subutilized. Right. Like, the average CPU utilization is very low. We want them to be using our resources efficiently. So what can we do? I think that's the question that we've been asking here. We have two options here, right? We can test multiple parameters in production. And the problem with that is that, first of all, testing, changing those parameters in production, it's risky. It's doable, but it's risky. Usually you have to do that in a very low traffic time during overnight, things like that. It's high effort because you're changing that. You need to do the change and wait to see if there were no problems. It's usually not easy to do that kind of test in production, and also you want to do that test in production, and you want to do a low test to check if the parameters that you have applied, if they actually work. Right. I didn't talk about this, but usually load tests is you want to do load testing against production environment, and you want to run those load tests on production environment, but at the same time, you don't want that load test to cause any outages, any issues. So it's usually a trade off. Like, how can you do that? So the other, we could do that in production, but we know it's very risk and it's very high effort. So what I decided to do here for this presentation, I'm going to do some local experimentation using Python. Right? Python is our answer, so we don't have to keep trying operator meters in production. And if I can do this in Python, and it should be a lot easier. So how do I solve this problem? First of all, I create a scaling simulator, right? I create a simulator. The link is going to be on my GitHub. The link is in the end, I create an autoscaling simulator, modeling how scaling works in AWS, one of the policies that we have there. It's just like a simple code where you simulate how it's going to work, like, based on wait for five minutes for the threshold, five minutes, breaching the threshold. If that happens, trigger a scale up event where you're going to spin off a machine, wait for like, three or four minutes to spin up that machine. After that, you can consider that machine to be healthy, and it will be able to receive traffic and how that balances out the cpuization across all of your hosts. It's a simulation code. It's really not hard to do. And based on that, what I did, these are like I'm applications. The parameters that we have there, like it's 25 minutes for the upper threshold, 15 minutes for the lower threshold, it scales up in one increment and it scales down by minus one increment, right. And I run that and I can see how my application would behave. The other thing I have done, I created a shares generator, right? Instead of doing a load, it's like our simulation of a load testing, right? So I have here the parameters, like during ten minutes, ramp up to 30k rpms, requests per minute. During eight minutes, ramp up to 150k rpms, and then keep that for another 30 minutes. And I create that, enable that in a data frame. And now I have here, like, this is my shape that I'm going to run my simulation. And what happens with that is that when I run that, like I get my new load shape, which is my autoscaling simulator. What I find out is that if I rim pump based on this graph here, this is how my outscaling would add hosts. Like it's adding one host here, adding another host here, adding another host here. It seems that it's adding at five minutes here. What happens is that my cpu utilization could be way above 50%, which is our threshold for quite some time here. Probably for. Yeah, we can leave in separate, but it seems to be like during 15 minutes, my applications would probably out of service, maybe more, right. Like in real life, that's probably more than that. So this is pretty bad, this configurations that we have here. Based on our simulation, it doesn't seem that it would work. Right. So we have problems. Right. And let me talk a little bit more about that. The reason that the CPU station is so high is because here at this point here, we are already receiving a lot of traffic, but we don't have any hosts that are available. It's just one host, right. And then it's two. It's not enough to handle all this traffic here. When I have five hosts, things starts to get better, right. So during this period here, I don't have enough hosts to handle this track. That's why that configuration doesn't work, right. So basically, our arp threshold of 25% and increment in one host at a time, it doesn't work. Right. And as we said, the lower threshold and the scale down increment won't matter. For this experimentation that we are doing here, my first attempt. Well, let's try to scale up in two increments, right. Instead of just adding one, it has two at a time. Right. And what we see here is that the problem hasn't gone away, right. So we still have a problem here. It's not working. And the funny part is that that's usually what teams do in production. They see that problem, they see that it didn't work. What they do, oh, let's try. Their first insight is that it's not scaling fast enough. So let's increase the scale up increment from one to two. But it didn't solve the problem. Right. Sometimes they're like clueless, what should we do? And I've seen teams what they do, they just like, hey, just add like ten hosts at all time and problem solve. Let's not using autoscaling anymore. Some teams decide to do that to abandon the outscaling the dynamic scaling strategy. But that's really not resourceful because most of the time you are going to have host provision that will not be used at all. Continue here. What we can also like to solve this, we can try to find what would be the best parameter. So I create two loops here. One we're going to be testing all increments from one to ten and then the upper threshold from eleven to 35. I didn't try above that but I mean I could and then I run my autoscaling simulator and I get the maximum cpuization that it would be using and I do, hey, I just want the results that are below 50% and then based on that I will have all the simulations that were successful here, right. And I transform that in a data form so I can use that in pandas, right. What I have with that is a winner. Like I have a winner. Parameters, parameters which is basically like what my simulation is telling me here. It's like why don't you use two for scale of increment and 15 for the upper threshold like these two here, the lower threshold, it doesn't matter a lot. And what says like for that load shape that I used here the average cpu should be around 16% and the maximum cp utilization should be 39%. Right. And I choose the best based on the average cp utilization and up threshold and scale increment. So by prioritizing have like a very average high cp utilization because I want to use the most cp utilization that I can from my holds and I think upper threshold. It seems that yeah, I chose like this shouldn't matter that much. It's just like to have a parameter if there's a draw here. But I think what I'm really interested is in average cpuization. That cpu from my hosts, right? And when I actually run the simulator against those parameters, things look a lot better, right? It does spin off hosts faster. They add hosts at two increments at a time, and cpu utilization is below here, 50%. So it seems that this setup is a lot better than what we had. So I take these parameters and I actually apply those parameters to production. And I run a real load testing using low cost again, right? And I try to like. There's a way to set up the shape in low cost as well. So it does create here like an incremental ramp up. It does this by adding new users. And I ran that for maybe 40 minutes, about 40 minutes. And we see that things stabilize, right? Like from the low testing two perspective, right, the p 95%, it does have like a d two spike, but it's like 40 milliseconds in terms of response prime. And it's quite stable, right? Like after some time, my application runs very stable. Let's see the parameters. And then I get the same parameters. This is the traffic that I was able to generate. I wasn't able to get 150%. It peaked here to 120k rpm. But it did work. We see how the hosts were being created here. We see that in cremine it added two hosts at a time. And we see the cpuization, it didn't reach 39%. I think the peak here was about 26, 27%. So it's pretty good. And the response time here, it was quite stable as well. Right? Based on that, we can compare our a simulator, like our simulation, to what happened in real life. This is like my simulation. When I use the request per minute data from my load testing, this is the superiorization that it predicted it would be using. Like this is my simulation, this is what actually happened in real life. And we see here how I predicted that the hosts would be created. My simulation, it seems to be a little bit slower than what happens in real life. Like the AWS, it starts adding new hosts earlier, like five minutes earlier probably. My simulation can be changed. So it better represents how the actual dynamic scaling in AWS works to mimic this behavior. And we see that my missimulation, it's adding hosts a little bit faster than what we see on AWS. So it starts a little bit later, but it increments faster. AWS outscaling group with the dynamic scaling policies, scales a little bit slower, but it starts earlier, right? So that's the difference between my simulation and what we see in real life. But I mean, I think this is quite interesting because we learned a lot how autoscaling works, right? Like how the autoscaling group and its autoscaling policies, how they work, by using Python to model this and verify what parameters can be used. The other part that I find really interesting is that I ran the load testing in production, and it just worked. That was like the first attempt just worked. I think that's really powerful. Usually you don't see that when you're doing load testing. It's quite common to go bad, and you have to stop the load testing because you're having issues or because it's just not working, because you have the wrong parameters set in your scanning configuration. So I think that's basically the takeaways that I have found here, and that's how we can use Python to actually reason and think about how we can do things we can do in production, right, in AWS infrastructure. And the conclusion here from my experimentation is that when I changed the upper threshold to 15, from 25 to 15, and the upper increment, it did work, it was efficiently, and I didn't have problems. I think we can assert that the code, I uploaded the code here, right? So you can access that. The code that I used for this, it's very simple code. There's nothing sophisticated for this. I'm using pandas for data manipulation, Jupiter for notebooking, so I can mix python code with graphics that I created, and it's more dynamic to do this kind of thing. Locust for load testing, and both of three to get AWS client. This technique here you can use for other topics as well. For provisioning, for capacity planning, for caching, for example. There are some tools that you can use for that. You can use the same tooling to understand how you should be setting up your alarms. You can use that to troubleshoot problems you have in our infrastructure. I think it's a very powerful, powerful approach, and I think there's a trend at the park for DevOps senior site reliability engineers to start using this kind of tooling in order to make the infrastructure better. Just using the observability tooling, you hit a limit of the things that you can do quite fast. So, Jupiter and Python, I think it's a very powerful tooling. I hope you have enjoyed, if you were able to with me, I hope you have enjoyed this approach. You can reach me through my LinkedIn page. It's available. It's here, right? LinkedIn or my GitHub is this one or you can drop me a message on Twitter as well. Right? So I hope you guys have enjoyed this and thank you for watching this.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 Python 2024 - Online

February 29 2024

Empowering AWS DevOps with Python and Machine Learning: Real-Life Applications for Cloud Infrastructure Optimization

Video size:

Abstract

Summary

Transcript

Slides

Gustavo Amigo

Senior Software Development Engineer @ Amazon Prime Video

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2024 - Online

February 29 2024

Empowering AWS DevOps with Python and Machine Learning: Real-Life Applications for Cloud Infrastructure Optimization

Video size:

Abstract

Summary

Transcript

Slides

Gustavo Amigo

Senior Software Development Engineer @ Amazon Prime Video

Join the community!