Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to Comp Kube native Con. This is
really fun. I get to share with you horizontal auto scaling with
Kubernetes. Let's dive in. Here's the part where I tell you I am
definitely going to post the slides on my site tonight.
I've changed similar speakers and it's never worked out very well for me either.
So let's go to robridge.org where we can find the slides online right
now. We'll go to robridge.org and
click click here on presentations and we can see horizontal autoscaling
with Kubernetes. The slides are online right now.
While we're here on robrich.org, let's click on about me and see some
of the things that I've done recently. I'm a developer advocate for
Jetpackio. If you're struggling with Kubernetes, I would love to learn with you. I'm also
a Microsoft MVP and MCT, a Docker captain and
a friend of Redgate. AZ Givecamp is really fun. AZ Givecamp brings
volunteer developers together with charities to build free software.
We start building software Friday after work. Sunday afternoon, we deliver
completed software to charities. Sleep is optional, caffeine provided. If you're in
Phoenix, come join us for the next AZ give camp. Or if you'd like a
give camp close to where you live, hit me up here at the conference or
on email or Twitter. And let's get a gift camp in your neighborhood too.
Some of the other things that I've done I do a lot with Kubernetes and
Docker and one of the things I'm the most proud of I
replied to a Net Rocks podcast episode they read my comments on here.
They sent me a mug.
So let's dig into horizontal auto scaling with Kubernetes.
We talked about this guy first. Let's talk about autoscaling now.
As we talk about scaling, part of what we're trying to accomplish
is to meet the capacity when we need it and to save money
when we don't. That's the nature of scaling.
When we talk about scaling, we can talk about both horizontal autoscaling and
vertical scaling. With horizontal scaling, we're adding
more items to be able to reach that capacity.
With vertical scaling, we're increasing the size of each item
to be able to reach that capacity.
So dialing in some more. With vertical scaling,
we're increasing the size of each item.
Now this might be great for things that need state where we
don't want to manage synchronization, maybe a database by
comparison with horizontal scaling, we are increasing the number of items.
Now we need to coordinate between them. We may need to populate
data into a new node. We may need to ensure that
there's one main node and that they coordinate together.
This synchronization isn't necessary if we're using a
stateless service, perhaps a web server.
So horizontal and vertical scaling.
Now how did we get here? What are we building on top of? Whose shoulders
are we standing on top of? Well, back in the old day,
scaling was hard, it was slow, so we would
generally over provision. We're provisioning for the traffic
on our peak day. Maybe that's Black Friday, maybe that's Super
Bowl Sunday. Maybe that's when we go viral. But because
we're over provisioning for those worst case scenarios,
on a normal day our machines may sit completely idle.
Now why did we do this? Well, we did this because provisioning
was hard. It might take days or weeks or months
to get approval, buy the hardware, install the content
and install the operating system, then install
our application, plug this into the load balancer.
That's definitely not something that we can do. If we have additional load yesterday
that we need to handle tomorrow, this process may take weeks or
months. So we need to have it all the way done by the time we
reach that peak load. So we're over provisioning to
be able to support the load on those extreme circumstances.
Today we don't need to do that. Today we
buy just what we need. Utility billing with clouds allows
us to scale easily and quickly to meet
the demand. And then when the demand eases, then we
can give those resources back and stop paying them.
Previously we would run our machines mostly idle so that
we had additional capacity available. Today we run our machines
mostly at capacity. 80 or 90% is not uncommon
because we really want to use our hardware most effectively.
So utility billing makes this possible.
Now, all of that scaling applies to any scaling scenario.
Let's apply this specifically to kubernetes. Now in
kubernetes we could talk about scaling a cluster. We're not going to
do that today. But as you grab the slides from robrich.org,
dig into scaling the cluster and that can be a really fun topic.
Today we're going to talk about scaling the workload, talking about
pods. Now that presumes that your cluster is big enough to handle
this. Next up we could talk about vertical scaling
or horizontal scaling. Vertical scaling is definitely interesting.
It's about changing resource limits on our pods to match
increased demand. But today we're going to talk about horizontal
scaling. We're going to talk about increasing the count of pods
to match the demand that we have. So let's dig
into pod scaling. Well,
why isn't this automatic? Why isn't there just a push the button
and now we have scaling in our cluster. Well, it depends a lot on
the workload, in particular, how your workload works and what
metric you're using to measure it. What metric?
Now, we might have a cpu bound workload, in which case we want to scale
based on our cpu. Or we might have an IL bound workload,
in which case we want to scale based on the request current
concurrent request. Or maybe the request queue length.
We may scale based on external factors. Maybe we're looking at
our message bus queue length. Or maybe we're looking at our
load balancer for details of how we should scale.
Or maybe we're looking at latency in a critical function.
Is it taking a long time to log in? Maybe we need to up the
servers associated with our authentication process. Now this
is definitely not an exhaustive list of metrics, but why isn't
this built in? Because it really depends on how our workload works.
If it works based on a cpu scaling metric, and we're scaling based
on I o metrics, then of course we're not going to scale correctly.
Let's take a look at a few use cases. In this case, we chose
to scale based on cpu, but it's an I O bound workload.
Now, because it's an I O bound workload, our system sits
mostly idle as we're waiting for our external data store.
Now in this case, because we're waiting for our external data store,
and well, our machine is idle, the cpu is low,
and our system never discovers that our system is under load. So we're
never going to scale up beyond the minimum number of pods.
Similarly, maybe we're autoscaling based on an I O bound workload,
so we're scaling based on concurrent requests.
And perhaps our application framework limits the number of
concurrent requests, putting back pressure on our load balancer to queue
the incoming requests. And so we only have a certain number
of concurrent requests. So we'll never scale beyond our
normal thing. We'll never scale beyond
our minimum because Kubernetes doesn't
know that our system is under load, we've chosen the wrong metric.
Let's look at a third use case here. We've chosen
to scale based on our service bus
queue length. So if there are messages in our
queue, we're going to scale up additional pods to be able to handle those messages.
Perhaps each message sends an email and then when the
queue length is short, then we'll scale back down
so that we're not using extra resources. In this case we matched
our metric with our business concerns and so
we're able to scale appropriately.
Now in each of these scenarios we looked at mechanisms where we
could choose a metric to scale and in many of the instances we chose
the wrong metric. Not to say that those metrics aren't good for scaling,
but that in those scenarios that's not how that application works.
Now this is definitely not an exhaustive list, but as you look at scaling
you might look to these and other metrics to understand the health of
your system and what your system looks like under load.
So let's take a look at the Kubernetes autoscaler and in particular
let's look at built in metrics.
Our first step in enabling the Kubernetes
autoscale is to enable the metric server. The metric server captures
cpu and memory on all of our pods and
presents that to horizontal autoscale. So let's turn on the
metric server first. Is it on? Let's do a
kubectl top for our pods and in this case across
all namespaces and see if it errors. If it errors, we need to turn it
on. So let's head off to the metric server part of the Kubernetes
project. Go grab the latest release and apply components.
Yaml in our case we're using minicube, so I just enabled the add
on to enable the metric server. Now that we've got the metric server
enabled, let's deploy our workload. Now we do need to customize
our workload ever so slightly to ensure that the autoscale will behave
as expected. We need to have a resource that knows
how to build pods. So a deployment,
a stateful set or a daemon set. In this
case we now have a recipe, a template of how to build pods and
we can scale out as we need to in
our pod definition. We also need to set resource limits.
Kubernetes is going to look to these resource limits to know how much
capacity we have left in that pod before it needs to create another
similar. We need to remove the replica count in
our deployment. The replica count is going to be controlled by the
autoscaler, not by the deployment anymore.
So here's our deployment. Or maybe it's a stateful set or daemon set.
You notice that we've commented out the replicas. This shouldn't be here because
the autoscaler will do it. We've also set limits so that we know how
much it is. So if we're using 100%
of the capacity of this pod, that means we're using half a cpu.
For using more than half a cpu, we're using more than 100% of this
pod. And if we're using a quarter of the cpu, we're using half the capacity
for this pod. Next up, let's build the autoscaler.
Now the autoscaler is going to go grab that metric and
understand across all of the pods if we need to,
then increase the pods to get that average down, or decrease
the count of pods to get that average up.
A horizontal autoscaler now it's going to go check these metrics
every 15 seconds. And once it notices that it needs to make a
change, then it'll make that change, but then it won't make additional changes
for five minutes. Now we can definitely customize these defaults,
but these defaults help us to not overly burden our
system and to reach consensus once we've made a scaling
change. Maybe we need to populate the cache or adjust
our content across all of our available nodes
now. And that may take some time. So once we've reached stasis
now, we can make additional scaling decisions.
So let's create a horizontal autoscaler. Now here with this Kubectl
command, we can just quickly identify the deployment that it's cpu
bound, the target percentage, in this case 50%. And the minimum and
maximum number of pods we should create. Now 50% probably
isn't a good metric. We probably should run more like 80 or 90%,
but this is a good demo. We could also deploy
this as a Kubernetes YamL file.
Now we've identified the deployment that we want to target
and it is a deployment. We've specified the min and max replicas,
so we'll have between one and four pods. And we've identified the
type of thing that we want to do. In this case, it's a resource.
It's cpu or memory. We've chosen cpu and we
say that we have a utilization an average of 50%. Now 50%
is probably low, maybe we want to go 80%. But this is the percent
of the details in our deployment. So here in
our deployment we have 0.5 cpu. So 50%
of our 0.5 cpu would be a quarter of a
cpu. If we go over a quarter of a cpu, it sounds
like we need more pods. We go under a quarter of a cpu,
we'll need less pods.
So let's do this demo. We have
here a deployment that we will
build out. Now this references a pod that is
just a basic exprs server. It has a single
route that will allow us to just grab text. Notice that we've set
our resource limits so that we can auto scale based on those limits.
And we've chosen just the cpu. Now we
could specify other limits if we want, but we want to make sure that we
horizontally autoscale based only on a single metric.
We've also commented out the replicas. We don't want our deployment
to specify the count, we want our autoscale to specify it.
So here's our autoscale and we can see that it
scales based on a deployment. This is the deployment name
and we scaled based on our cpu.
And in this case, because it's a very small application,
we're going to only scale on 5% of that half a cpu.
So first stop is to make sure that our metric server is enabled.
Let's do cubectl top pod
and let's look in all namespaces. And it looks like yes, in this case
our metric server is enabled. We get the cpu and memory for
all of our pods whether we're using a horizontal autoscale or
not. Great. Cubectl apply F
cpu deploy. We've got our deployment in place
and let's also grab our horizontal autoscaler.
Now when we first spin
up our horizontal autoscaler our value will be unknown.
It takes 15 seconds for us to do a lap to go ask the
pods what their resources are. And so for those 15 seconds we
don't know if we need to do anything. So let's take a look.
Oh it looks like we have 31%. So it's going to go create a
whole bunch of pods to match that capacity.
Now I'm surprised that it got to 31% straight away. Let's change
this metric then to be 25%
and apply this again.
Now we probably have to deploy our
horizontal auto scaler again.
Yes. So now we should only need two pods. Now we
were in that spot where we were automatically checking.
We're in that spot where we've made an adjustment. And so now
it'll take a while for this to calm down, but let's
generate some load on our system and see if that changes things. So in
this case I have a busy box thing where I'm just curling into
that app and so it's
returning. Hello world. And we'll do that a whole bunch of times.
So now that we've got some load on the system, let's take a look at
our autoscale. And it looks like we're down to zero. So now
we've scaled down a whole lot. Now let's take a
look at our Autoscale and
we'll watch it and we'll see if we capture some change in
metrics now that we've added some load to it.
Now, it does run every 15 seconds, so we will have to wait
a little bit to see if this metric changes.
But once it changes, then we can recalculate the number of pods
that we need. Now we can adjust the 15 seconds.
We can make it more aggressive, so maybe 10
seconds or sooner to be able to look at our pods more aggressively.
Or we can specify it much more,
much larger so that we put less load on our pods
because, well, it needs to go look. Okay, so now we've gotten up
to 24%. So now based on our 24%,
let's see if we need additional pods. No, it looks like we don't.
So I'm going to specify this back at 10%
and let's deploy our autoscale.
And now based on 24%,
then it will spin up a whole lot of pods. Great. We saw how we
could spin up pods and spin down pods based
on the needs of our system. Cubectl,
delete F cpu.
Let's delete our deployment. Let's also delete our
autoscale.
And now we can see that those are done. Oh, we're still
generating a load. Let's undo that.
Now. If we do kubectl top pod,
we can see that we're still collecting metrics even though there's no horizontal
autoscale using those metrics. That's fine, it's nice data
to have, but the metric server is still running great.
So we got to see the horizontal autoscale,
we got to scale based on a cpu metric. And that was great.
Let's take a look at other metrics that we might want to choose. Now,
as we looked at this, here's kind of our mental model. Our horizontal
pod autoscaler goes and checks the application
every 15 seconds to see if data needs adjusting.
Now, that's a little bit of a naive interpretation
because, well, we have our metric server here. I love these
graphics. Grab these slides from robridge.org and click through
to the learnks IO page.
It's a great tutorial. Now, this too is
a little bit naive because inside the metrics server we actually have
three different APIs we have resource metrics,
APIs, we have custom metrics and we have external
metrics. Now, zooming in on each of those, the resource metrics
is about cpu and memory. Those are the two built in custom
metrics. We might look to our pods to be able to find other
details, any details that we can get out of those pods would
work nicely here. In the custom metrics API, maybe we'll harvest
Prometheus metrics for example, and then external metrics.
Maybe we're looking at a requests queue length or a
message queue length and taking actions based on
resources that are external to our pods, maybe other Kubernetes
resources, maybe other cloud resources, other hardware within our environment.
We can look to that to make external decisions about how many pods
we need. So we have these three APIs
and each of them will reach out to an adapter that will reach out
to a service to get data. So for example,
in the metrics API we reach out to the metrics server.
The metrics server in turn looks to see Advisor and C
advisor will harvest those metrics from our pods.
In our custom metrics API. Perhaps we're using the Prometheus adapter.
The Prometheus adapter looks to Prometheus and Prometheus
looks to all of our pods. When we're looking at external things,
maybe we're using Prometheus to monitor those as well. Or maybe
we're using another adapter. So we
have the autoscaling in really interesting ways.
Let's take a look at how we might use that with Prometheus.
So our first stop is to install the metrics server. We did
this previously and even though we're not using the metrics coming out of the metrics
server, this gets us the metrics APIs as well.
Then we need to install Prometheus. Now I'm grabbing Prometheus
from the Prometheus helm chart.
The Prometheus community Prometheus helm chart installs
just Prometheus. But if I chose the Prometheus stack, I would also
get Grafana. Now I could choose to put this in a
different namespace. That would probably be a good best practice, but for the
sake of today's demo, I'll just put it in the default namespace.
Next up, I'm going to install the Metrics adapter. Now I'm going to pull this
from Prometheus community helm charts as well. And so
then I'll install this Prometheus adapter. I am going to
set some properties here. I could do this with a yaml file or
here, I'll just set them straight away. I'm going to point it at the prometheus
URL. The Prometheus helm chart creates a prometheus service
called Prometheus service and it's on port 80 instead
of 90 90 as it typically is.
So I'll configure it to point to my Prometheus service.
And now I've got the Prometheus adapter running.
Now I'm going to configure my workload. Now I could configure
Prometheus to point it at where my workload is and configure
it that way, but because it's all running in my cluster, I can create
an annotation on that deployment that will allow Prometheus
to automatically discover it. So here in my deployment,
I'm going to create an annotation on my pod
that identifies that. I want Prometheus to scrape
the metrics and what's the URL and port that it should scrape?
Now this is perfect. I've identified that Prometheus should
scrape the metrics out of all of these pods.
I turn it on, I give it the URL, and I give it the port.
Now it is important that these are quoted because if they aren't, this would
be a boolean and an integer respectively, and they need to be
strings to be annotations. So I put quotes around it and
it works just fine. Now I'll deploy my workload kubectl,
apply that deployment, and now I've got my content. I know that Prometheus
is going to monitor that content, make those metrics available to the Prometheus
adapter, and now our horizontal autoscale can
harvest those metrics to make scaling decisions. So next up,
let's put in the horizontal autoscale. Now I've got this horizontal
autoscale and in this case it's going to target a deployment. Here's the
name of my deployment. And rather than resources as
we saw previously, this one's going to scale based on pods. I wish
this said prometheus, but it doesn't.
We'll give it some interesting metric from prometheus that the Prometheus
adapter knows how to get and we can give it a value that we
want to specify. So if it's lower than this, then we'll
reduce the number of pods and if it's above this, then we'll increase the number
of pods. Now, how might we look
to other metrics? There are lots of different adapters that we can
look at and let's tour. Come. Here's a really great project
that allows us to create adapters from all kinds of sources.
So we could call HTTP into our pods
and harvest JSON querying into that JSON
to be able to grab a particular metric. There's also Prometheus collector,
although the Prometheus adapter is usually a better choice.
We could look at the influxDB collector, which is a great example
of how we might query other data stores the AWS collector.
So we could look at, for example, sqs queue length. We could use
this as a template to grab things from blob storage
or GCP. And we can also scale
based on a schedule. If we know that our work starts in the morning and
ends in the evening, and that we won't use the cluster overnight, we can create
a schedule to automatically scale that up or down. Now, unlike the
other metrics that look to behaviors in our cluster, this will just
do it on a timer and maybe that's sufficient.
We could also look to istio and grab metrics from
istio. There are many other adapters that will allow
us to query other systems to be able to get metrics
both for pods and for external resources where we can make
choices about how to scale. We took a look
at built in sources for cpu and memory. Now those are
definitely the easiest, but if our workload isn't based on cpu
and memory, if it's based on another metric, perhaps we can look to
Prometheus to be able to find metrics specific for our application.
Anything that we can expose as a Prometheus metric we can
then use to auto scale our service.
Similarly, we could look to external metrics.
Perhaps we're looking at our load balancer or our queue length
and making decisions on how many pods we need. Based on that,
let's take a look at some best practices. Now, these best practices will
help us to make good scaling decisions.
What if we scale up too high? Maybe we're getting
attacked. Maybe we have a bug in our software,
or maybe we've just gone viral and we need to be able to scale up
to impossible scale. We probably don't want to scale infinitely
because, well, especially in the case of an attack, we don't want that big
cloud bill. It would be really easy for us to stay
scaled for the duration of the attack and end up paying a lot.
So let's create a max value that matches the
budget of our organization and lets us determine that.
Well, we're going to understand that our system may not
be able to reach that capacity, but it'll reach our budgetary goals.
If it's an attack, we're not paying extra to handle that attack.
Or if we've gone viral, maybe we need to reconsider that upper bound.
The other reason we might scale up too high is maybe we're
monitoring more than one metric. Now, Kubernetes is going
to do the right thing here. If we're monitoring two metrics and one
is high and one is low, Kubernetes will use the high metric
to be able to reach the capacity to make sure that all the
metrics are within bounds. But for example, if we're monitoring
both cpu and memory, and the cpu is low,
we might say, well, you shouldn't have scaled up here,
but maybe the memory is high and so we did scale up by
comparison. So here we need
to define the max replicas in addition to
the min replicas. Don't just assume that infinitely high
is sufficient. Pick a budget that matches the needs of our organization and
understand that our system may be less available if our load
exceeds that. But we're reaching our budget goals, which is the
needs of the business. Next up, we may scale too slow.
Now, we noted how the horizontal autoscale only checked every 15
seconds. That's definitely a configurable value. And in
the case of missing metrics, then Kubernetes will assume the best
case scenario. So if we're scaling out, Kubernetes will
assume that that pod is using 0%, therefore it won't
scale out unnecessarily. If we're scaling in, Kubernetes will
assume that that metric was at 100%. Therefore it probably won't scale
in until that metric is available. So if it takes
a long time for our pods to be able to surface these metrics,
then we may notice that the Kubernetes autoscaler won't take action.
What if when we deploy a new version, it always resets to one pod
and then scales back up? Yeah, if we put our
HUD coded pod limit, then we might hit a scenario. Well, what happened
here? In our deployment or our other service,
we had a hard coded replicas count as we deployed
the new version of our deployment. Or stateful set or replica
set, Kubernetes followed our instructions and killed off all the other
pods. So we only ended up with one or maybe a few. Yeah,
we really want our autoscale to handle this. So we
need to comment out the replicas line in our deployment.
We want our horizontal autoscaler to hit this. Now. This does create an edge
case where when we deploy it the very, very first time, we're only going to
get one until the autoscaler notices. Well, I'd rather
have 15 seconds of only one pod than
have it reset to one on every deployment. If you grab these slides
from roberts.org and click through to this post, you can find more about
that edge case and how to handle it gracefully.
Flapping or sloshing? The documentation talks about flapping
as if the door keeps opening and closing. I like to talk about sloshing like
we're just pushing the water up against the beach and so the water
just keeps going. Let's assume that we have a Java app that
takes a while to spin up. We notice that we don't have enough capacity,
so we spin up some pods. 15 seconds later we notice that those
pods aren't live yet and we choose to decide to scale up again.
And so now in a minute, once our Java app is booted and everything is
working, we now have too much capacity. So we'll scale
back down. Then we'll notice we don't have up scale back up and down and
up and on we go, sloshing back and Forth. Well, what happened
here? Well, Kubernetes will automatically block additional scaling
operations until our system has reached stability.
So by default, it's five minutes. In this case, I added some content
to my horizontal autoscaler to set that stabilization
window to 60 seconds,
600 seconds or ten minutes. So I want to give
my time, a little bit of my application, a little bit extra time to get
stable. Alternatively, in our demos, we chose to scale
it, to set that stabilization window to 1
second so that it would automatically make additional scaling things. And that
worked out really well for our demo. Scale to zero. Now in
our horizontal autoscale, it won't scale to zero, it'll only scale to
one. And now we're paying for our application to
run even if there's no load at all. How do we overcome this?
Well, the problem is that, well, how do we start the application back
up? If a request comes in and there are no deployment,
no pods running to handle it, then that request
will just fail. So to make this happen, we need a reverse proxy in
front of that to be able to handle the load. Notice there's
no pods and kick it back up. Now, it may take a minute for
a pod to start on a completely cold start, so maybe that initial request
also still fails. But that's our scale to zero problem. We need
a reverse proxy in front of it to be able to notice that our
cluster needs turning back on. Now that's not built in. That's a
problem larger than just horizontal autoscaling between one and
a set number. So we might need to reach for external tools.
Knative and keta both offer this solution, and so
we can reach for one of these products to be able to get to scale
to zero. Now, it does mean that we're going to have some pods running associated
with knative and Keda. So if we only have one microservice,
maybe just leaving that one microservice running, that might
be a simpler solution than a scale to zero solution.
We took a look at horizontal autoscaling in Kubernetes and it was really
cool. Horizontal autoscaling is about scaling
up when we have additional demand and scaling down, paying less
when we don't have demand. This is made possible by utility billing.
We can go to a cloud, we can get some resources. When we're done,
we can hand them back and pay. Nothing and the scaling operation can
handle can happen in real time. So we don't need to pre
provision our hardware. We don't need to over buy our systems
to reach maximum capacity. We can scale up and down as
our needs are.