Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to my presentation. Today I'm going to be talking
about how we can use data science and machine learning
to work with AWS DevOps.
Right? If you manage a service deployed
to AWS, I think this presentation is for you.
So let me start presenting myself,
introducing myself. My name is Gustavo Beveramigo. I'm a senior software
engineer at Amazon Prime Video.
Before Amazon, I was working with startups and I've
been doing this work for quite
a long time. And I've been introducing on my daily
basis Python machine learning and all
of its tooling to manage those infrastructures.
So let's start here.
Okay, the agenda for today we're going to be talking about how
can we use Python and its tooling to fine tune your
service to perfection. We're going to be using pandas for data manipulation,
Jupyter for notebooking, locust for load
testing. It's a python tool for load testing, and both
of three to access our AWS account.
Right. The target audience are software engineers, software reliability engineers,
and DevOps. Right. And I'm assuming that you are
familiar with pandas a little bit for data manipulation and
also with Jupyter. But don't worry if you don't know them
that much. If you know Python and AWS, I think this should be
enough for you to understand this presentation. All right,
first of all, I think the first thing that
I want to talk about is related to scaling.
I think the first part of scaling, scaling a service is
really hard. I've been doing this for years and
it's always hard. It doesn't matter if it's a small scale
or if it's a big scale, right, as you have at Amazon or
on big tags or if you're in startup, it's always hard.
And the reason that it's hard is because testing
scalability is really hard, right.
It's really a lot of work to understand if
your service can scale and when you have scaling issues,
it's a lot of work to fix that. The other thing
that makes autoscaling hard is that it's really hard to reason
about scaling to understand what are the bottlenecks and
how we can make your service scale is
something that is really not easy. And the other thing
is that infrastructure is expensive. So if
infrastructure was cheap or cheap, you could just scale
the number of resources you're using and that would be fine. But that
infrastructure is really expensive. So you need to be really resourceful in
the resources that you use so it doesn't
get more expensive than it needs to be. So I'm going to
be talking a little bit about this. In this talk,
first, I'm going to present you a
problem, a problem that I've seen multiple times
in my career, all right? Which is you
have a service scales dynamically.
What I mean by that is that you have multiple hosts, right?
You have like one host here, another host here, another host
here, and use your cloud resources,
depending on the traffic that you're receiving. You want
to add more hosts, right?
And you do like, we call that dynamically scaling or sometimes
autoscaling, right. The problem here
is that how we can do this outscaling, how we can set up
our infrastructure in a way that it
will scale at the right velocity and using
the right resources at the right time without
having problems. Right. What I mean by problems without
your services being unavailable and how we can do that in
efficient ways. So if you're too conservative,
you're going to be adding more hosting than necessary. If you're too aggressive,
you won't add hosts at the right time and at the right scale.
So the way that I set up this problem here,
I set up an infrastructure just for this talk
where I will do a scaling experiment,
right? And we're going to be using python
to analyze this problem and try to come up with a
solution. So that's my approach to this. So let's
say we have this infrastructure here which scales
dynamically. And let's say that we have a traffic that
increases in this space here, right?
It starts like after ten minutes,
scales to 30,000 requests per minute.
And then you have a search here that in eight minutes
it will scale from 30,000 rpm
to close to 150,000
requests per minute. This is a very aggressive
ramping up, and that does happen in real life. All right,
so that's not something that we haven't seen
that does happen. Right. If you work with
retail, for example, ecommerce, we have that
on Black Fridays or when you have promotions and
things like that, or a marketing campaign,
it's not uncommon to see this kind of shape. So the way
that we usually work, we prepare
our infrastructure and our service in order to handle
that kind of search.
Sometimes this search happens unexpectedly.
Right. All the seminar, there's like a promotion that was
sent out by the marketing team and your
team was not prepared for that, but your
service needs to scale to that amount of
traffic in a way that
you won't have any problem, so you won't have unavailability
issues. So that's the problem that I'm presenting to you guys.
And that's the problem that I'm going to try to solve here
using Python machine learning data science tools
like the tools that we use usually for
other types of problems. I'm going to be using this for this kind of
problem. A little bit of the context.
I deployed application to AWS Pinstalk
basically under the hood it uses application load balancer to
load balance those route
that prime to EC two hosts. The application is
deployed against an EC two host, EC two box.
The instance type is a three t three micro
which is a very small instance. I created that just for
this experiment that we are running here.
Also we use autoscaling group to scale
that cluster and the scaling parameters that
I have set up. This is quite standard like this scale up
and scale down implements is one and minus one. That means that every time
that scaly event is triggered I'm going to
increment that by one host or by minus one host.
The upper thread showed is going to be 25%.
It's 25% of cpu utilization,
right. So when the average cpu utilization across all of your services
is above 25% it's going
to add one host, right. And then when it's
the lower threshold, it's written down here, but it's the lower
threshold here it's 15%.
This doesn't really matter for our experiment here because we are
testing the ramp up time, not how
it scales down your fleet.
And the metric that we are using is a utilization, right.
The other parameters here, like the time that it takes
to do this scaling, it waits for five minutes.
The CP utilization needs to breach that limit for five minutes.
If that happens it will scale the service. It takes like three to four
minutes to spin up a new host.
Right. So this is our scaling parameters, right. This is quite
standard set up. I've seen that multiple times. Usually people use 25%,
30%. That's quite standard. Like that's quite
a common sense. Sometimes we change
this scale up and scale down increment to try
to improve things. That's how it usually works.
The approach that I'm going to be using, I'm going to low test
one host, right. Because I need an exact profile
of how one host behaves when it receives
traffic. I'm going to retrieve that data from AWS
Cloudwatch to pandas. Right. This is the second line
here so I can work with that data.
Based on that I can determine what would be the capacity for one host.
And based on that I can try to find what would
happen when this
cluster receives more traffic. How it will scale out,
scale out is when you are adding more hosts to handle
the additional traffic that you're receiving. And finally,
I will try to test the parameters that we have figured
out here during our experimentations. I will
try that in production, the setup that I have done here.
And we're going to analyze the results of that.
All right, first of all, I load test one
host, right? So I just use this product here.
It's called locust. I think
that's how it's pronounced. It's an open source tool
for Python. It's really easy to use,
and I use that just to send traffic to one host,
right. And just to
check if it can handle how much traffic it can handle.
And all that after that, what I did
retrieved that data from AWS to
pandas. This is how you can do like you can use bottle
three to get the metrics right.
This is quite AWS. This is a query you create
against. This part here is a
query you create against,
right where you're gaining the utilization.
You point into a specific outscaling group and
you start like a period of time.
You get that and you transform that to
a data frame, which is the data type here
that we're using with. And one tip that
I give people is that once you do that, save that to cvs,
to a csv file, because data
on cloud watch, they will be erased after a period
of time, depending on the period that you're fetching here.
If it's 10 seconds, I think it's unavailable only for 24
hours. I think for 1 minute, it's available
for a couple of days and so on. So it's not going to be there
forever. So if you need that data for future work,
do save that to a CSV so you don't lose
that data. All right, moving forward here,
after I do this, I can just use pandas to actually
draw plot
graphics here and start analyzing the data
that you have. This is pretty much similar to what observability tools
you would use. It's not that different. But once you're using
pandas, you get access to all the tools
that are available in pandas that are really sophisticated.
So here we can see the CP utilization and
then we see how the request per minute
it increases according to the CP utilization.
That's quite interesting, but I think we can do a little better than
that. So what I did here,
I compare the CP utilization against the
RPM and I run a linear regression to
find, based on the request permitted that I'm
receiving what is going to be the CPU utilization. And also
based on that, I can create a predictor. So I
can predict, based on the request permitted, how much cpu
it's going to be optimizing, right?
And the other thing that I can find here is that, is this relation linear?
Because sometimes if the traffic you're receiving,
the supervisation is not linear, you might have a problem,
right? So it should be linear, but it's
not always linear. So we are validating a hypothesis
here, like an assumption here, that it's linear. And now we know the
coefficient for this, which means that
for 40k requests per minute, you're going to be using
about 30% of your cpu. This is
for one host, right? That's quite interesting.
That's better than what we have. Like now we have an idea of how many
hosts we need in order to handle
our traffic. There's another thing as well.
Like I noticed that response time, the latency here,
which is the time that you put the request, how long does
it take to get a response? And we see
here that while the cputilization is below
here, 50, right, in this period, it's pretty normal.
It does increase a little bit here, but it's not something
that it becomes
a worry. But after 50, we notice that things get
out of hand, right? We see all these guys here.
So my reasoning on here is
that after 50% of cputonization,
my application doesn't work. So I need to keep my application
below 50%. So it's always available.
If it's go above 50%, things will not work
correctly. So that's a really nice find.
Right? And I'm going to be using that as
a parameter to scale my service. Right. Sometimes when
you do that, some applications is 60%, some of them
like 80%, but there's like a cut off percentage
here that you need to find out.
For this application, it's 50%. It might be related to the
instance type, which is very small, maybe. And it's
one of those three t three instance. If you're
using a C instance, which is meant for CPU
intensive application, it might behave differently. It might be related
to the application that I've deployed here, which is the same publication
from AWS. It's not being sophisticated.
So this is how this application works,
right? And I'm going to be using that.
We haven't actually answered the question how we can scale,
but we have some clues here. Like the
CPU utilization should be kept below 50%,
but we don't know how to do that. Efficiently. We don't know how
to keep this CP utilization below 50%.
We don't know how to use most of CPU at
all times. We don't want a fleet that is subutilized.
Right. Like, the average CPU utilization is very low. We want them
to be using our resources
efficiently. So what can we do? I think
that's the question that we've been asking here. We have
two options here, right? We can test multiple parameters in
production. And the problem with that is
that, first of all, testing, changing those parameters in production,
it's risky. It's doable, but it's risky.
Usually you have to do that in a very low
traffic time during overnight, things like that. It's high
effort because you're changing that.
You need to do the change and wait to see if there
were no problems. It's usually not easy
to do that kind of test in production, and also
you want to do that test in production, and you want to do a
low test to check if the parameters that you have
applied, if they actually work. Right.
I didn't talk about this, but usually load tests is you
want to do load testing against production environment,
and you want to run those load tests on production environment,
but at the same time, you don't want that load test to cause
any outages, any issues. So it's usually a trade off.
Like, how can you do that? So the other,
we could do that in production, but we know it's very risk
and it's very high effort. So what I
decided to do here for this presentation, I'm going to do
some local experimentation using Python.
Right? Python is our answer, so we don't have to keep trying
operator meters in production. And if I can do
this in Python, and it should be
a lot easier. So how do I solve this problem?
First of all, I create a scaling simulator, right?
I create a simulator. The link
is going to be on my GitHub. The link is in the end,
I create an autoscaling simulator, modeling how scaling
works in AWS, one of the policies that we have there.
It's just like a simple code where you simulate
how it's going to work, like, based on wait for five minutes for
the threshold,
five minutes, breaching the threshold. If that happens,
trigger a scale up event where you're going to spin off
a machine, wait for like, three or four minutes to spin
up that machine. After that, you can consider that machine to be
healthy, and it will be able to receive traffic
and how that balances out the cpuization
across all of your hosts.
It's a simulation code. It's really not hard to
do. And based on that, what I did,
these are like I'm applications. The parameters that we have there,
like it's 25 minutes for the upper threshold, 15 minutes
for the lower threshold, it scales up in one increment and it
scales down by minus one increment,
right. And I run that and I can see
how my application would behave. The other thing I
have done, I created a shares generator,
right? Instead of doing a load, it's like our simulation of
a load testing, right? So I have here the parameters,
like during ten minutes, ramp up to 30k
rpms, requests per minute.
During eight minutes, ramp up to 150k
rpms, and then keep that for another 30 minutes.
And I create that, enable that in a data
frame. And now I have here, like, this is my shape that
I'm going to run my simulation. And what
happens with that is that when I run that,
like I get my new load shape, which is
my autoscaling simulator. What I find out is that if
I rim pump based on this graph here,
this is how my outscaling would add hosts.
Like it's adding one host here, adding another host here,
adding another host here. It seems that it's adding
at five minutes here.
What happens is that my cpu utilization
could be way above 50%, which is our threshold
for quite some time here. Probably for.
Yeah, we can leave in separate, but it seems to be like during
15 minutes, my applications would probably out of
service, maybe more, right. Like in real life,
that's probably more than that. So this is pretty bad,
this configurations that we have here.
Based on our simulation, it doesn't seem that it would
work. Right. So we have problems.
Right. And let me talk a little bit more
about that. The reason that the
CPU station is so high is because here at
this point here, we are already receiving a
lot of traffic, but we don't have any hosts that are
available. It's just one host, right. And then it's two.
It's not enough to handle all this traffic here.
When I have five hosts,
things starts to get better, right. So during
this period here, I don't have enough hosts to handle
this track. That's why that configuration doesn't
work, right. So basically, our arp threshold
of 25% and increment
in one host at a time, it doesn't work. Right.
And as we said, the lower threshold and the scale down
increment won't matter. For this experimentation that we
are doing here, my first attempt. Well,
let's try to scale up in two increments, right. Instead of
just adding one, it has two at a time.
Right. And what we see here is that the problem hasn't
gone away, right. So we still have a problem
here. It's not working. And the funny
part is that that's usually what teams do in production.
They see that problem, they see that it didn't work.
What they do, oh, let's try. Their first insight
is that it's not scaling fast enough. So let's increase the scale up increment
from one to two. But it didn't solve
the problem. Right.
Sometimes they're like clueless, what should we do? And I've seen
teams what they do, they just like, hey, just add like ten hosts
at all time and problem solve. Let's not using autoscaling anymore.
Some teams decide to do that to abandon the
outscaling the dynamic scaling strategy.
But that's really not resourceful because most of the time you are going
to have host provision that will not be used at all.
Continue here. What we can also like
to solve this, we can try to find what
would be the best parameter. So I create two loops here.
One we're going to be testing all increments
from one to ten and then the upper threshold from
eleven to 35. I didn't try above that but
I mean I could and then I run my autoscaling simulator
and I get the maximum cpuization that
it would be using and I do,
hey, I just want the results that are below 50% and
then based on that I will have all
the simulations that were successful here, right.
And I transform that in a data form so I can use
that in pandas, right. What I have with that is a winner.
Like I have a winner. Parameters,
parameters which is basically like what my simulation is telling
me here. It's like why don't you use two for scale
of increment and 15 for the
upper threshold like these two here,
the lower threshold, it doesn't matter a lot. And what
says like for that load
shape that I used here the average cpu should be around
16% and the maximum cp utilization should be 39%.
Right. And I choose the best based on the average cp utilization
and up threshold and scale increment. So by
prioritizing have like a very average high cp utilization because I
want to use the most cp utilization
that I can from my holds and I
think upper threshold. It seems that
yeah, I chose like this shouldn't
matter that much. It's just like to have a parameter
if there's a draw here. But I think what I'm really interested
is in average cpuization. That cpu
from my hosts, right? And when I
actually run the simulator against those
parameters, things look a lot better, right? It does
spin off hosts faster.
They add hosts at two increments at a time, and cpu
utilization is below here, 50%. So it seems
that this setup is a lot better than what
we had. So I take these parameters and
I actually apply those parameters to production.
And I run a real load testing using low cost
again, right? And I try to like. There's a way
to set up the shape in low
cost as well. So it does create here
like an incremental ramp up.
It does this by adding new users.
And I ran that for maybe
40 minutes, about 40 minutes. And we see that things
stabilize, right? Like from the low testing
two perspective, right, the p 95%,
it does have like a d two spike, but it's like 40 milliseconds
in terms of response prime. And it's quite stable, right? Like after
some time, my application runs very stable.
Let's see the parameters. And then I get the
same parameters. This is the traffic that I was
able to generate. I wasn't able to get 150%.
It peaked here to 120k
rpm. But it did work.
We see how the hosts were being created
here. We see that in cremine it added
two hosts at a time. And we see the cpuization,
it didn't reach 39%. I think the peak here was
about 26, 27%. So it's pretty good.
And the response time here, it was quite
stable as well. Right?
Based on that, we can compare
our a simulator, like our simulation,
to what happened in real life. This is
like my simulation. When I use the request
per minute data from my load testing,
this is the superiorization that it predicted it would be using.
Like this is my simulation, this is what actually
happened in real life. And we see here how
I predicted that the hosts would be created. My simulation,
it seems to be a little bit slower than what happens
in real life. Like the AWS,
it starts adding new hosts earlier,
like five minutes earlier probably. My simulation can
be changed. So it
better represents how the actual
dynamic scaling in AWS works to mimic this behavior. And we
see that my missimulation, it's adding hosts a little bit
faster than what we see on AWS. So it
starts a little bit later, but it
increments faster. AWS outscaling
group with the dynamic scaling policies,
scales a little bit slower, but it starts
earlier, right? So that's the difference between my simulation
and what we see in real life. But I mean, I think
this is quite interesting because we learned a lot how
autoscaling works, right? Like how the autoscaling group and its
autoscaling policies, how they work, by using Python to model
this and verify what parameters can be
used. The other part that I find really interesting is
that I ran the load testing in production, and it
just worked. That was like the first attempt just
worked. I think that's really powerful.
Usually you don't see that when you're doing load testing.
It's quite common to go bad, and you
have to stop the load testing because you're having issues
or because it's just not working, because you have the wrong parameters
set in your scanning configuration.
So I think that's basically the takeaways
that I have found here, and that's how we can use Python
to actually reason and think
about how we can do things we can do in production,
right, in AWS infrastructure.
And the conclusion here from my experimentation is that when I changed
the upper threshold to 15, from 25 to 15,
and the upper increment, it did work, it was
efficiently, and I didn't have
problems. I think we
can assert that the code,
I uploaded the code here, right? So you
can access that. The code that I used for this, it's very simple code.
There's nothing sophisticated for this. I'm using pandas
for data manipulation, Jupiter for notebooking,
so I can mix python
code with graphics that I created, and it's more
dynamic to do this kind of thing. Locust for load testing,
and both of three to get AWS client.
This technique here you can use for other topics as well.
For provisioning, for capacity planning, for caching,
for example. There are some tools that you can use for that.
You can use the same tooling to understand
how you should be setting up your alarms. You can
use that to troubleshoot problems you have in our infrastructure.
I think it's a very powerful, powerful approach,
and I think there's a trend at the park for
DevOps senior site reliability engineers to start
using this kind of tooling in order to make the infrastructure
better. Just using the observability tooling,
you hit a limit of the things that you can do quite fast.
So, Jupiter and Python,
I think it's a very powerful tooling.
I hope you have enjoyed, if you were able to with
me, I hope you have enjoyed this approach.
You can reach me through my LinkedIn page. It's available.
It's here, right? LinkedIn or my
GitHub is this one or you can drop me a message on Twitter as
well. Right? So I hope you guys have enjoyed this and thank
you for watching this.