Conf42 Kube Native 2022 - Online

Horizontal Autoscaling with Kubernetes

Video size:

Abstract

Now that the app is running in Kubernetes, how do we scale it to meet demand? What metric should we use? CPU? Requests? something else? Let’s dig into why we auto-scale, and how we auto-scale with lots of examples. Finally we’ll look at potential pitfalls and gotchas like how to scale to 0 and how to avoid scaling too big for your budget. Come learn how to scale with Kubernetes.

Summary

  • I get to share with you horizontal auto scaling with Kubernetes. The slides are online right now. AZ Givecamp brings volunteer developers together with charities to build free software. If you're in Phoenix, come join us for the next AZ give camp.
  • When we talk about scaling, we can talk about both horizontal autoscaling and vertical scaling. Utility billing with clouds allows us to scale easily and quickly to meet the demand. When the demand eases, then we can give those resources back and stop paying them.
  • Today we're going to talk about scaling the workload in kubernetes. It depends a lot on the workload and what metric you're using to measure it. Vertical scaling is about changing resource limits on our pods to match increased demand. horizontal scaling involves increasing the count of pods to meet demand.
  • We have a deployment that we will build out. Notice that we've set our resource limits so that we can auto scale based on those limits. Let's generate some load on our system and see if that changes things.
  • We have three different APIs we have resource metrics, APIs, we have custom metrics and we have external metrics. We can use that to make external decisions about how many pods we need. Let's take a look at how we might use that with Prometheus.
  • Kubernetes scales based on pods. How might we look to other metrics? There are lots of different adapters that we can look at. These best practices will help us to make good scaling decisions.
  • Next up, we may scale too slow. What if when we deploy a new version, it always resets to one pod and then scales back up? Kubernetes will automatically block additional scaling operations until our system has reached stability. How do we overcome this?

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to Comp Kube native Con. This is really fun. I get to share with you horizontal auto scaling with Kubernetes. Let's dive in. Here's the part where I tell you I am definitely going to post the slides on my site tonight. I've changed similar speakers and it's never worked out very well for me either. So let's go to robridge.org where we can find the slides online right now. We'll go to robridge.org and click click here on presentations and we can see horizontal autoscaling with Kubernetes. The slides are online right now. While we're here on robrich.org, let's click on about me and see some of the things that I've done recently. I'm a developer advocate for Jetpackio. If you're struggling with Kubernetes, I would love to learn with you. I'm also a Microsoft MVP and MCT, a Docker captain and a friend of Redgate. AZ Givecamp is really fun. AZ Givecamp brings volunteer developers together with charities to build free software. We start building software Friday after work. Sunday afternoon, we deliver completed software to charities. Sleep is optional, caffeine provided. If you're in Phoenix, come join us for the next AZ give camp. Or if you'd like a give camp close to where you live, hit me up here at the conference or on email or Twitter. And let's get a gift camp in your neighborhood too. Some of the other things that I've done I do a lot with Kubernetes and Docker and one of the things I'm the most proud of I replied to a Net Rocks podcast episode they read my comments on here. They sent me a mug. So let's dig into horizontal auto scaling with Kubernetes. We talked about this guy first. Let's talk about autoscaling now. As we talk about scaling, part of what we're trying to accomplish is to meet the capacity when we need it and to save money when we don't. That's the nature of scaling. When we talk about scaling, we can talk about both horizontal autoscaling and vertical scaling. With horizontal scaling, we're adding more items to be able to reach that capacity. With vertical scaling, we're increasing the size of each item to be able to reach that capacity. So dialing in some more. With vertical scaling, we're increasing the size of each item. Now this might be great for things that need state where we don't want to manage synchronization, maybe a database by comparison with horizontal scaling, we are increasing the number of items. Now we need to coordinate between them. We may need to populate data into a new node. We may need to ensure that there's one main node and that they coordinate together. This synchronization isn't necessary if we're using a stateless service, perhaps a web server. So horizontal and vertical scaling. Now how did we get here? What are we building on top of? Whose shoulders are we standing on top of? Well, back in the old day, scaling was hard, it was slow, so we would generally over provision. We're provisioning for the traffic on our peak day. Maybe that's Black Friday, maybe that's Super Bowl Sunday. Maybe that's when we go viral. But because we're over provisioning for those worst case scenarios, on a normal day our machines may sit completely idle. Now why did we do this? Well, we did this because provisioning was hard. It might take days or weeks or months to get approval, buy the hardware, install the content and install the operating system, then install our application, plug this into the load balancer. That's definitely not something that we can do. If we have additional load yesterday that we need to handle tomorrow, this process may take weeks or months. So we need to have it all the way done by the time we reach that peak load. So we're over provisioning to be able to support the load on those extreme circumstances. Today we don't need to do that. Today we buy just what we need. Utility billing with clouds allows us to scale easily and quickly to meet the demand. And then when the demand eases, then we can give those resources back and stop paying them. Previously we would run our machines mostly idle so that we had additional capacity available. Today we run our machines mostly at capacity. 80 or 90% is not uncommon because we really want to use our hardware most effectively. So utility billing makes this possible. Now, all of that scaling applies to any scaling scenario. Let's apply this specifically to kubernetes. Now in kubernetes we could talk about scaling a cluster. We're not going to do that today. But as you grab the slides from robrich.org, dig into scaling the cluster and that can be a really fun topic. Today we're going to talk about scaling the workload, talking about pods. Now that presumes that your cluster is big enough to handle this. Next up we could talk about vertical scaling or horizontal scaling. Vertical scaling is definitely interesting. It's about changing resource limits on our pods to match increased demand. But today we're going to talk about horizontal scaling. We're going to talk about increasing the count of pods to match the demand that we have. So let's dig into pod scaling. Well, why isn't this automatic? Why isn't there just a push the button and now we have scaling in our cluster. Well, it depends a lot on the workload, in particular, how your workload works and what metric you're using to measure it. What metric? Now, we might have a cpu bound workload, in which case we want to scale based on our cpu. Or we might have an IL bound workload, in which case we want to scale based on the request current concurrent request. Or maybe the request queue length. We may scale based on external factors. Maybe we're looking at our message bus queue length. Or maybe we're looking at our load balancer for details of how we should scale. Or maybe we're looking at latency in a critical function. Is it taking a long time to log in? Maybe we need to up the servers associated with our authentication process. Now this is definitely not an exhaustive list of metrics, but why isn't this built in? Because it really depends on how our workload works. If it works based on a cpu scaling metric, and we're scaling based on I o metrics, then of course we're not going to scale correctly. Let's take a look at a few use cases. In this case, we chose to scale based on cpu, but it's an I O bound workload. Now, because it's an I O bound workload, our system sits mostly idle as we're waiting for our external data store. Now in this case, because we're waiting for our external data store, and well, our machine is idle, the cpu is low, and our system never discovers that our system is under load. So we're never going to scale up beyond the minimum number of pods. Similarly, maybe we're autoscaling based on an I O bound workload, so we're scaling based on concurrent requests. And perhaps our application framework limits the number of concurrent requests, putting back pressure on our load balancer to queue the incoming requests. And so we only have a certain number of concurrent requests. So we'll never scale beyond our normal thing. We'll never scale beyond our minimum because Kubernetes doesn't know that our system is under load, we've chosen the wrong metric. Let's look at a third use case here. We've chosen to scale based on our service bus queue length. So if there are messages in our queue, we're going to scale up additional pods to be able to handle those messages. Perhaps each message sends an email and then when the queue length is short, then we'll scale back down so that we're not using extra resources. In this case we matched our metric with our business concerns and so we're able to scale appropriately. Now in each of these scenarios we looked at mechanisms where we could choose a metric to scale and in many of the instances we chose the wrong metric. Not to say that those metrics aren't good for scaling, but that in those scenarios that's not how that application works. Now this is definitely not an exhaustive list, but as you look at scaling you might look to these and other metrics to understand the health of your system and what your system looks like under load. So let's take a look at the Kubernetes autoscaler and in particular let's look at built in metrics. Our first step in enabling the Kubernetes autoscale is to enable the metric server. The metric server captures cpu and memory on all of our pods and presents that to horizontal autoscale. So let's turn on the metric server first. Is it on? Let's do a kubectl top for our pods and in this case across all namespaces and see if it errors. If it errors, we need to turn it on. So let's head off to the metric server part of the Kubernetes project. Go grab the latest release and apply components. Yaml in our case we're using minicube, so I just enabled the add on to enable the metric server. Now that we've got the metric server enabled, let's deploy our workload. Now we do need to customize our workload ever so slightly to ensure that the autoscale will behave as expected. We need to have a resource that knows how to build pods. So a deployment, a stateful set or a daemon set. In this case we now have a recipe, a template of how to build pods and we can scale out as we need to in our pod definition. We also need to set resource limits. Kubernetes is going to look to these resource limits to know how much capacity we have left in that pod before it needs to create another similar. We need to remove the replica count in our deployment. The replica count is going to be controlled by the autoscaler, not by the deployment anymore. So here's our deployment. Or maybe it's a stateful set or daemon set. You notice that we've commented out the replicas. This shouldn't be here because the autoscaler will do it. We've also set limits so that we know how much it is. So if we're using 100% of the capacity of this pod, that means we're using half a cpu. For using more than half a cpu, we're using more than 100% of this pod. And if we're using a quarter of the cpu, we're using half the capacity for this pod. Next up, let's build the autoscaler. Now the autoscaler is going to go grab that metric and understand across all of the pods if we need to, then increase the pods to get that average down, or decrease the count of pods to get that average up. A horizontal autoscaler now it's going to go check these metrics every 15 seconds. And once it notices that it needs to make a change, then it'll make that change, but then it won't make additional changes for five minutes. Now we can definitely customize these defaults, but these defaults help us to not overly burden our system and to reach consensus once we've made a scaling change. Maybe we need to populate the cache or adjust our content across all of our available nodes now. And that may take some time. So once we've reached stasis now, we can make additional scaling decisions. So let's create a horizontal autoscaler. Now here with this Kubectl command, we can just quickly identify the deployment that it's cpu bound, the target percentage, in this case 50%. And the minimum and maximum number of pods we should create. Now 50% probably isn't a good metric. We probably should run more like 80 or 90%, but this is a good demo. We could also deploy this as a Kubernetes YamL file. Now we've identified the deployment that we want to target and it is a deployment. We've specified the min and max replicas, so we'll have between one and four pods. And we've identified the type of thing that we want to do. In this case, it's a resource. It's cpu or memory. We've chosen cpu and we say that we have a utilization an average of 50%. Now 50% is probably low, maybe we want to go 80%. But this is the percent of the details in our deployment. So here in our deployment we have 0.5 cpu. So 50% of our 0.5 cpu would be a quarter of a cpu. If we go over a quarter of a cpu, it sounds like we need more pods. We go under a quarter of a cpu, we'll need less pods. So let's do this demo. We have here a deployment that we will build out. Now this references a pod that is just a basic exprs server. It has a single route that will allow us to just grab text. Notice that we've set our resource limits so that we can auto scale based on those limits. And we've chosen just the cpu. Now we could specify other limits if we want, but we want to make sure that we horizontally autoscale based only on a single metric. We've also commented out the replicas. We don't want our deployment to specify the count, we want our autoscale to specify it. So here's our autoscale and we can see that it scales based on a deployment. This is the deployment name and we scaled based on our cpu. And in this case, because it's a very small application, we're going to only scale on 5% of that half a cpu. So first stop is to make sure that our metric server is enabled. Let's do cubectl top pod and let's look in all namespaces. And it looks like yes, in this case our metric server is enabled. We get the cpu and memory for all of our pods whether we're using a horizontal autoscale or not. Great. Cubectl apply F cpu deploy. We've got our deployment in place and let's also grab our horizontal autoscaler. Now when we first spin up our horizontal autoscaler our value will be unknown. It takes 15 seconds for us to do a lap to go ask the pods what their resources are. And so for those 15 seconds we don't know if we need to do anything. So let's take a look. Oh it looks like we have 31%. So it's going to go create a whole bunch of pods to match that capacity. Now I'm surprised that it got to 31% straight away. Let's change this metric then to be 25% and apply this again. Now we probably have to deploy our horizontal auto scaler again. Yes. So now we should only need two pods. Now we were in that spot where we were automatically checking. We're in that spot where we've made an adjustment. And so now it'll take a while for this to calm down, but let's generate some load on our system and see if that changes things. So in this case I have a busy box thing where I'm just curling into that app and so it's returning. Hello world. And we'll do that a whole bunch of times. So now that we've got some load on the system, let's take a look at our autoscale. And it looks like we're down to zero. So now we've scaled down a whole lot. Now let's take a look at our Autoscale and we'll watch it and we'll see if we capture some change in metrics now that we've added some load to it. Now, it does run every 15 seconds, so we will have to wait a little bit to see if this metric changes. But once it changes, then we can recalculate the number of pods that we need. Now we can adjust the 15 seconds. We can make it more aggressive, so maybe 10 seconds or sooner to be able to look at our pods more aggressively. Or we can specify it much more, much larger so that we put less load on our pods because, well, it needs to go look. Okay, so now we've gotten up to 24%. So now based on our 24%, let's see if we need additional pods. No, it looks like we don't. So I'm going to specify this back at 10% and let's deploy our autoscale. And now based on 24%, then it will spin up a whole lot of pods. Great. We saw how we could spin up pods and spin down pods based on the needs of our system. Cubectl, delete F cpu. Let's delete our deployment. Let's also delete our autoscale. And now we can see that those are done. Oh, we're still generating a load. Let's undo that. Now. If we do kubectl top pod, we can see that we're still collecting metrics even though there's no horizontal autoscale using those metrics. That's fine, it's nice data to have, but the metric server is still running great. So we got to see the horizontal autoscale, we got to scale based on a cpu metric. And that was great. Let's take a look at other metrics that we might want to choose. Now, as we looked at this, here's kind of our mental model. Our horizontal pod autoscaler goes and checks the application every 15 seconds to see if data needs adjusting. Now, that's a little bit of a naive interpretation because, well, we have our metric server here. I love these graphics. Grab these slides from robridge.org and click through to the learnks IO page. It's a great tutorial. Now, this too is a little bit naive because inside the metrics server we actually have three different APIs we have resource metrics, APIs, we have custom metrics and we have external metrics. Now, zooming in on each of those, the resource metrics is about cpu and memory. Those are the two built in custom metrics. We might look to our pods to be able to find other details, any details that we can get out of those pods would work nicely here. In the custom metrics API, maybe we'll harvest Prometheus metrics for example, and then external metrics. Maybe we're looking at a requests queue length or a message queue length and taking actions based on resources that are external to our pods, maybe other Kubernetes resources, maybe other cloud resources, other hardware within our environment. We can look to that to make external decisions about how many pods we need. So we have these three APIs and each of them will reach out to an adapter that will reach out to a service to get data. So for example, in the metrics API we reach out to the metrics server. The metrics server in turn looks to see Advisor and C advisor will harvest those metrics from our pods. In our custom metrics API. Perhaps we're using the Prometheus adapter. The Prometheus adapter looks to Prometheus and Prometheus looks to all of our pods. When we're looking at external things, maybe we're using Prometheus to monitor those as well. Or maybe we're using another adapter. So we have the autoscaling in really interesting ways. Let's take a look at how we might use that with Prometheus. So our first stop is to install the metrics server. We did this previously and even though we're not using the metrics coming out of the metrics server, this gets us the metrics APIs as well. Then we need to install Prometheus. Now I'm grabbing Prometheus from the Prometheus helm chart. The Prometheus community Prometheus helm chart installs just Prometheus. But if I chose the Prometheus stack, I would also get Grafana. Now I could choose to put this in a different namespace. That would probably be a good best practice, but for the sake of today's demo, I'll just put it in the default namespace. Next up, I'm going to install the Metrics adapter. Now I'm going to pull this from Prometheus community helm charts as well. And so then I'll install this Prometheus adapter. I am going to set some properties here. I could do this with a yaml file or here, I'll just set them straight away. I'm going to point it at the prometheus URL. The Prometheus helm chart creates a prometheus service called Prometheus service and it's on port 80 instead of 90 90 as it typically is. So I'll configure it to point to my Prometheus service. And now I've got the Prometheus adapter running. Now I'm going to configure my workload. Now I could configure Prometheus to point it at where my workload is and configure it that way, but because it's all running in my cluster, I can create an annotation on that deployment that will allow Prometheus to automatically discover it. So here in my deployment, I'm going to create an annotation on my pod that identifies that. I want Prometheus to scrape the metrics and what's the URL and port that it should scrape? Now this is perfect. I've identified that Prometheus should scrape the metrics out of all of these pods. I turn it on, I give it the URL, and I give it the port. Now it is important that these are quoted because if they aren't, this would be a boolean and an integer respectively, and they need to be strings to be annotations. So I put quotes around it and it works just fine. Now I'll deploy my workload kubectl, apply that deployment, and now I've got my content. I know that Prometheus is going to monitor that content, make those metrics available to the Prometheus adapter, and now our horizontal autoscale can harvest those metrics to make scaling decisions. So next up, let's put in the horizontal autoscale. Now I've got this horizontal autoscale and in this case it's going to target a deployment. Here's the name of my deployment. And rather than resources as we saw previously, this one's going to scale based on pods. I wish this said prometheus, but it doesn't. We'll give it some interesting metric from prometheus that the Prometheus adapter knows how to get and we can give it a value that we want to specify. So if it's lower than this, then we'll reduce the number of pods and if it's above this, then we'll increase the number of pods. Now, how might we look to other metrics? There are lots of different adapters that we can look at and let's tour. Come. Here's a really great project that allows us to create adapters from all kinds of sources. So we could call HTTP into our pods and harvest JSON querying into that JSON to be able to grab a particular metric. There's also Prometheus collector, although the Prometheus adapter is usually a better choice. We could look at the influxDB collector, which is a great example of how we might query other data stores the AWS collector. So we could look at, for example, sqs queue length. We could use this as a template to grab things from blob storage or GCP. And we can also scale based on a schedule. If we know that our work starts in the morning and ends in the evening, and that we won't use the cluster overnight, we can create a schedule to automatically scale that up or down. Now, unlike the other metrics that look to behaviors in our cluster, this will just do it on a timer and maybe that's sufficient. We could also look to istio and grab metrics from istio. There are many other adapters that will allow us to query other systems to be able to get metrics both for pods and for external resources where we can make choices about how to scale. We took a look at built in sources for cpu and memory. Now those are definitely the easiest, but if our workload isn't based on cpu and memory, if it's based on another metric, perhaps we can look to Prometheus to be able to find metrics specific for our application. Anything that we can expose as a Prometheus metric we can then use to auto scale our service. Similarly, we could look to external metrics. Perhaps we're looking at our load balancer or our queue length and making decisions on how many pods we need. Based on that, let's take a look at some best practices. Now, these best practices will help us to make good scaling decisions. What if we scale up too high? Maybe we're getting attacked. Maybe we have a bug in our software, or maybe we've just gone viral and we need to be able to scale up to impossible scale. We probably don't want to scale infinitely because, well, especially in the case of an attack, we don't want that big cloud bill. It would be really easy for us to stay scaled for the duration of the attack and end up paying a lot. So let's create a max value that matches the budget of our organization and lets us determine that. Well, we're going to understand that our system may not be able to reach that capacity, but it'll reach our budgetary goals. If it's an attack, we're not paying extra to handle that attack. Or if we've gone viral, maybe we need to reconsider that upper bound. The other reason we might scale up too high is maybe we're monitoring more than one metric. Now, Kubernetes is going to do the right thing here. If we're monitoring two metrics and one is high and one is low, Kubernetes will use the high metric to be able to reach the capacity to make sure that all the metrics are within bounds. But for example, if we're monitoring both cpu and memory, and the cpu is low, we might say, well, you shouldn't have scaled up here, but maybe the memory is high and so we did scale up by comparison. So here we need to define the max replicas in addition to the min replicas. Don't just assume that infinitely high is sufficient. Pick a budget that matches the needs of our organization and understand that our system may be less available if our load exceeds that. But we're reaching our budget goals, which is the needs of the business. Next up, we may scale too slow. Now, we noted how the horizontal autoscale only checked every 15 seconds. That's definitely a configurable value. And in the case of missing metrics, then Kubernetes will assume the best case scenario. So if we're scaling out, Kubernetes will assume that that pod is using 0%, therefore it won't scale out unnecessarily. If we're scaling in, Kubernetes will assume that that metric was at 100%. Therefore it probably won't scale in until that metric is available. So if it takes a long time for our pods to be able to surface these metrics, then we may notice that the Kubernetes autoscaler won't take action. What if when we deploy a new version, it always resets to one pod and then scales back up? Yeah, if we put our HUD coded pod limit, then we might hit a scenario. Well, what happened here? In our deployment or our other service, we had a hard coded replicas count as we deployed the new version of our deployment. Or stateful set or replica set, Kubernetes followed our instructions and killed off all the other pods. So we only ended up with one or maybe a few. Yeah, we really want our autoscale to handle this. So we need to comment out the replicas line in our deployment. We want our horizontal autoscaler to hit this. Now. This does create an edge case where when we deploy it the very, very first time, we're only going to get one until the autoscaler notices. Well, I'd rather have 15 seconds of only one pod than have it reset to one on every deployment. If you grab these slides from roberts.org and click through to this post, you can find more about that edge case and how to handle it gracefully. Flapping or sloshing? The documentation talks about flapping as if the door keeps opening and closing. I like to talk about sloshing like we're just pushing the water up against the beach and so the water just keeps going. Let's assume that we have a Java app that takes a while to spin up. We notice that we don't have enough capacity, so we spin up some pods. 15 seconds later we notice that those pods aren't live yet and we choose to decide to scale up again. And so now in a minute, once our Java app is booted and everything is working, we now have too much capacity. So we'll scale back down. Then we'll notice we don't have up scale back up and down and up and on we go, sloshing back and Forth. Well, what happened here? Well, Kubernetes will automatically block additional scaling operations until our system has reached stability. So by default, it's five minutes. In this case, I added some content to my horizontal autoscaler to set that stabilization window to 60 seconds, 600 seconds or ten minutes. So I want to give my time, a little bit of my application, a little bit extra time to get stable. Alternatively, in our demos, we chose to scale it, to set that stabilization window to 1 second so that it would automatically make additional scaling things. And that worked out really well for our demo. Scale to zero. Now in our horizontal autoscale, it won't scale to zero, it'll only scale to one. And now we're paying for our application to run even if there's no load at all. How do we overcome this? Well, the problem is that, well, how do we start the application back up? If a request comes in and there are no deployment, no pods running to handle it, then that request will just fail. So to make this happen, we need a reverse proxy in front of that to be able to handle the load. Notice there's no pods and kick it back up. Now, it may take a minute for a pod to start on a completely cold start, so maybe that initial request also still fails. But that's our scale to zero problem. We need a reverse proxy in front of it to be able to notice that our cluster needs turning back on. Now that's not built in. That's a problem larger than just horizontal autoscaling between one and a set number. So we might need to reach for external tools. Knative and keta both offer this solution, and so we can reach for one of these products to be able to get to scale to zero. Now, it does mean that we're going to have some pods running associated with knative and Keda. So if we only have one microservice, maybe just leaving that one microservice running, that might be a simpler solution than a scale to zero solution. We took a look at horizontal autoscaling in Kubernetes and it was really cool. Horizontal autoscaling is about scaling up when we have additional demand and scaling down, paying less when we don't have demand. This is made possible by utility billing. We can go to a cloud, we can get some resources. When we're done, we can hand them back and pay. Nothing and the scaling operation can handle can happen in real time. So we don't need to pre provision our hardware. We don't need to over buy our systems to reach maximum capacity. We can scale up and down as our needs are.
...

Rob Richardson

Developer Advocate @ Jetpack.io

Rob Richardson's LinkedIn account Rob Richardson's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)