Conf42 DevSecOps 2022 - Online

Proactive cluster autoscaling in Kubernetes

Video size:

Abstract

TL;DR; Scaling nodes in a Kubernetes cluster could take several minutes with the default settings. Learn how to size your cluster nodes and proactively create nodes for quicker scaling.

Summary

  • Kubernetes lets you scale infinitely, auto heal your cluster and so on. But how does the scheduler know how much memory and cpu a pod uses? Well, it doesn't, strictly speaking. You need to spoon feed this with requests and limits.
  • But that has actually only caused me a penalty in this scenario of $1,154. In effect, my company is in profit based on where it presumed it would be if it didn't have any auto scaling of about $15,000. Ultimately, this is an experiment to kind of play with and try and get your name on the leaderboard.
  • If we sized our machine correctly, we might be able to fit all of our workload in a node. So we've done some of the hard work for you in this space and created an instance calculator. With this I can tune and tweak the configuration of my nodes. Next live demo.
  • Samuel: What if we could always have at least one node that's ready for when you need it? To do this, we can create a placeholder pod that as soon as your workload comes along needing the resource, the placeholder pod becomes evicted. This will continue as you scale into further nodes, keeping you always one step ahead.
  • Kubernetes is not how it's designed to work in the same way. You need to give kubernetes as much information as you can in order for it to make the best decision. How does that work when we oversubscribing our vms in VMware?
  • A demo of how this works with autoscaling. On the left hand side, we can see the request per second that we're serving. And you can see my nodes can take up to four workload pods. How does this work in a more real worldish type scenario?
  • As the traffic builds up, that placeholder quickly becomes evicted and our workload pods become scheduled. Your plan is to always have extra capacity ready and waiting for your workload to require it. You can tune your workload in order to make sure that you're not leaving gaps.
  • I've been Chris Nesbittsmith thank you again for joining me today. Like subscribe, whatever the kids do on LinkedIn, GitHub, whatever you can be. Rest assured there'll be little to no spam since I'm much content.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Cool. Awesome few tech, eh? So to kick things off, my name is Chris Nesbittsmith. I'm based in London and currently an instructor for learn KX and a consultant to various bits of UK government and a tinkerer of open source stuff. I've been using and abusing kubernetes in production since it was 0.4, so believe me when I say it's been a journey. I've certainly got the scars and the war wounds to show for it. So you believe the hype that Kubernetes lets you scale infinitely, auto heal your cluster and so on. So your cluster is self monitoring, it's scaling up instances of your cloud native stateless applications on demand when you need more, but all of a sudden your nodes are full, you can scale no more. Well, enter the cluster autoscaling and of course a splash of yaml in order to save the day. And that can integrate with your cloud vendor to provision more necessary nodes. And the good news is that the autoscaler is configurable. Though sadly, as we'll see, it's not quite as configurable as you might hope or expect. There are alternatives, but the official cluster autoscaling only scales up when there's pending pods in order to satisfy the demand, which is probably a good idea since there's little point adding more nodes unless you have the workload that needs them. Okay, so first let's refresh ourselves on how the Kubernetes kind of scheduler works. So if I create a deployment with two replicas, I do this by submitting some Yaml to the API server, which then writes it to etcd. The controller is watching for this type of event, recognizes it needs to go and create some pods, which it does, and then these are now pending. The scheduler is that there's a component that is looking for pending nodes, sees these, and then schedules or assigns them to a node. The scheduling, however, is broken down into a few steps from the initial queue through filtering viable nodes to then scoring them before actually finally creating the binding. But how does the scheduler know how much memory and cpu a pod uses? Well, it doesn't, strictly speaking. So you need to spoon feed this with requests and limits. So if you don't specify requests and limits, kubernetes will play completely blind. Your cluster will inevitably become overloaded, nodes will become oversubscribed, and you'll be constantly fighting fires. So if your only takeaway from any of this is that all your containers should definitely have requests and limits defined, then we've at least achieved something useful here. So requests are the initial ask and limits are the point where the container will be throttled at cpu or killed if it exceeds the memory. Okay, so applications come in all sorts of shapes and sizes. So you may have some applications that are more cpu intensive and don't require much memory, while on the other hand you may have others that have a greater memory than a cpu footprint. So those applications have to be deployed inside computing units which have again their own cpu and memory kind of characteristics. So for every application deployed in a cluster, Kubernetes makes a note of the memory and cpu requirements. It then decides where to place the application in the cluster. In this case, it's on the far left node. If another application of the same size is deployed, well, Kubernetes goes through that same process and finds the best node to run the app. In this case, it picks the right hand node. As more applications are submitted to the cluster, Kubernetes keeps making nodes of the cpu and memory requirements and allocating these applications in the cluster. If you play this game long enough, you might notice that Kubernetes appears at least to be a reasonably skilled tetris player. So your servers of the board, your apps of the blocks and Kubernetes is trying to fit as many blocks as efficiently as possible. But what about the size of the worker nodes? Well, what kind of instance types can you use to build your cluster? Well, nowadays the cloud vendors make almost every instance type available to be part of the cluster, so you've got pretty much free choice. There is a catch though, so could be forgiven for thinking that if you get an eight gig ram and two cpu node from your cloud vendor, you could deploy four pods that are one and a half gig ram and need a quarter of a cpu. However, it's not quite so one of those pods would remain pending, which if configured will of course cause the cluster autoscaler to go and create a new node and then eventually your workload becomes scheduled. But why is this so? When you provision a managed instance, you might think that the memory and cpu available can be used for running pods. And you are right. However, some memory and cpu should be reserved for the operating system and you should also reserve memory and cpu for the Kubelet. But surely the rest is available to my pods, right? Well, not quite yet. So you also need to reserve some memory for the eviction threshold. So if the kubelet notices that the memory usage is going over that point, it will start evicting nodes. Your cloud vendor will usually choose these numbers for you. For example, say AWS typically reserves 255 meg memory for kubernetes, eleven meg for each pod that you can deploy on that instance. So this is the reserved memory for the Kubernetes. So the cpu reserved is usually around zero three to zero four of a cpu for the operating system. They reserve 100 meg of memory and zero one cpu for the eviction threshold and another 100 meg. So in AWS if you select an m five large, though, here's a visual recap of how the resources are subdivided. With this particular instance, you can deploy 27 nodes. The other thing to consider is all the time that this takes. So let's assume that you've configured your horizontal pod autoscaler or HPA in order to scale up your pods dynamically. So well, that's where the journey probably starts. So to start with, about 90 seconds is what's needed for your horizontal pod autoscaler to react and decide to scale up your application. Then the cluster autoscaler then takes around say 30 seconds to request a new node from the cloud vendor, then around three to four minutes for the machine to boot, and then around another 30 seconds for it to join the cluster and then be ready to run workloads. Then you can of course add on time for pulling your container image that won't be cached on this brand new machine as well record. So to help visualize the impact of this this can have I did a library last year that actually fakes a Kubernetes scheduler. It allows you to specify many different types of pods and model their scaling dynamics, tracking container startup times, so on, and define your node properties. So it takes a lot of shortcuts in order to provide hundreds of thousands of intervals representing kind of days and do that in tens of milliseconds. It's not the real Kubernetes scheduler. Pull requests are very welcome if you'd like to improve it. So to give you a way to play with that, I also made a game as a novelty for Kubecon last year called Black Friday. The scenario is that you're an SRE team supporting a retailer facing a spike in traffic on Black Friday, and then again on cyber Monday with a lull between and a calm before and after she sees some yes, it's a three tier service of a front end, back end and database, all of which have different kind of scaling properties, startup times, et cetera. So we can see some of the details here in the hints. So we can see the properties on our nodes or different node types, and we can see the profile of the front end and similar for the back end. A bit zoomed in at the minute on this presentation. So you can see here. Here's my Friday spike, and here's my cyber Monday spike and how the heuristics of how that trials off. So the goal is to configure your cluster to as closely follow the spike with just enough infrastructure. So failing some requests and getting a few SLA penalties might actually result in a greater profit ultimately. So let's have a kind of play on this. So if I, for example, change my, say, minimum node count down to one, and similarly, just let's, let's see what this looks like. If I drop everything down to start at one instance of everything, 1015, you can see I can change the point on when I scale my, when I scale up these, these pods. So let's see. Hopefully that should schedule out and we'll see some failed requests. Yeah, love you. Maybe live demos. Hey. Yep, that's live demo failing. Maybe let's go with. Go with two. Failing to schedule my first interval. Okay, so we can see that in this, we've reached a point where we've got some failed requests right here and some more failed requests on Cyber Monday. So this is where my infrastructure scaling and my pod auto scaling failed to follow the spike up of demand as closely as it needed to. But that has actually only caused me a penalty in this scenario of $1,154. So, in effect, my company is in profit based on where it presumed it would be if it didn't have any auto scaling of about $15,000. So maybe not a bad idea. Perhaps. And you can tune and tweak these numbers. So please do feel free to play with this. May the odds ever be in your favor. Ultimately, this is an experiment to kind of play with and try and get your name on the leaderboard. Cool. Okay, first live demo out the way. What do we do in order to stack some of the nodes in our favor? Well, we can not scale at all. That is always an option and often overlooked. Or what if we could get a head start on the scaling? So maybe not scaling at all. Sounds a bit flippant, but what do I really mean by that? Well, going back to our scenario of fitting our pods on a machine, taking into account the reserves for the Kubernetes, if we sized our machine correctly, we might be able to fit all of our workload in a node. So this isn't easy, given the vast array of possible machine sizes. So we've done some of the hard work for you in this space and created an instance calculator. Next live demo. So with this I can tune and tweak the configuration of my nodes so I can say that I want to look at the efficiency and how many pods I can fit on it, so I can size my pods up in memory and see that dynamically move around. So I can see that I've not got a particularly efficient nodes size. Now I can flip around, I can see that other nodes offer a bearing level of efficiency if my pod looks like it requires these sorts of kind of properties, and I can densely stack my things so I can hunt around for different cloud vendors and pick the right tool, pick the right node for the type, and I can optimize for either cpu or indeed for memory as well, and change some of the properties around demon sets and agents and things like that. Cool. Okay, so finally onto the topic of this webinar. All the waits over, he says his clicker stops working. So what if we could always have at least one node that's ready for when you need it? So removing that three and a bit minute wait. Well, to do this, we can create a placeholder pod that as soon as your workload comes along needing the resource, the placeholder pod becomes evicted, causing your cluster autoscaler to boot a new machine in order to host the new replacement placeholder. And this will continue as you scale into further nodes, keeping you always one step ahead. Okay, now to praise the demo gods again, where I do a real live demo and hope that everything works with a real Kubernetes cluster. I have backup plans of videos if not. So let's have a look. So behind the scenes here, there is a real Kubernetes cluster, and I can prove that in a minute. So we've got a simple application where we can see the effects of clicking on the scale buttons. So if I say to scale to five, we start with one replica, okay. And clicking the five, hopefully that should start across my fingers and hope that the thing that I built works. Cool. Hopefully. So they're pending. Never works. We have a backup plan. Finer they are all there for the browser not loading the timer. Right. So what you would see is this timer starting to count up. We'll ignore it for the minutes, but it will add a. We'll check back in a sec, but what you'll see is this timer should elapse to three minutes. So if you look at the time of how long we've been talking here, it will be about three minutes in order to scale up to have two nodes. So in the minute the cluster auto scaler is going off and asking for a new node, we can do that in a separate instance over here. So I've got two cluster to actually at least show you the timer that actually works. It did. Okay, fine. Live demos. Okay, similarly, so assume that this is like plus 10 seconds, the other one's plus how many minutes, fine, we'll come back to them. We started one replica, we scaled to five. So the current node gets saturated with the four nodes and one is left pending behind the scenes. So now the cluster autoscaling is going to request a new node from Linode. So while I stall for about three minutes of what would otherwise be kind of radio silence of me praying for it to kind of work properly, are there any questions I can come to? So, Sam, if we create a prescaled node for each node pool, would that then cost high? Right, so how to turn them off when we're not in use? So, yeah, if you've got a node that's hanging around that's basically just running a placeholder, then yeah, your plan is to ultimately waste some money, ultimately. So your plan is to keep that spare. So this isn't fundamentally an uncommon pattern that you might be used to seeing in other worlds where you might have a raid array, where you might have a hot spare in that, because sure, you can go and get a new drive or something or a new blade unit to throw in the chassis, but if you've got one already wrapped up, it saves you a trip to the data center. So likewise, albeit that's a timeline that normally is measured in best hours, probably days. This is, we're trying to now shave minutes off. So we're at a better place than we might have been with more traditional infrastructure and hardware. But I'll come to some of that point at the end how you might be able to lessen the financial impact of this sort of pattern. So we mentioned oversubscribing the nodes with requests and limits. How does that work when we oversubscribe our vms in VMware? So if you oversubscribe your nodes in the hypervisor, then you are in for a bad day. I guess it's a very short answer. Kubernetes is not how it's designed to work in the same way as you'll notice that Kubernetes does not play well with a swap for memory. It relies on kind of real resource, and you need to really kind of give kubernetes as much as it can get in terms of empowerment to kind of know what's going on. I think on the first node I can see. So I've got two clusters here, there's two tabs. This first cluster has now got its second node. So we should see a second pod. So once that's pending at the minute, I want the other nodes. There we go. Fine. Okay, so that took about three minutes. So sorry, I'll just finish off that question. So, yeah, you really need to give kubernetes as much information as you can in order for it to make the best decision. So like help it, help you ultimately tell it what's going on. That's okay. So as you can see, that took about three minutes for my pod to scale from one to five. The first bit of it was done within about 10 seconds and then the last kind of pending pod took a bit longer. So let's see what the difference is if I use my placeholder pattern. So what I'm going to do is I'm going to scale back down to one should hopefully go and kill my pods. Cool. And then I'm going to deploy my placeholder. So I've shown a minute, this is a real Kubernetes cluster. I've just got some Javascript and a web page that make it like submitting some yaml, just so you've got a visual view of what's going on my other screen, I am panically watching the actual kind of the logs and the events. So now I've got a node here. So I'm back down to my original scenario where I have my one instance of the application running and I have my placeholder nodes keeping the other nodes present. So it's a kind of hack on the auto scaler just to keep like a node available. So now if I click scale to five, I'll quite quickly see that it saturated the first node and in a few seconds. 7 seconds, there we go. My container has now booted in the second node. So we've gone from near on four minutes down to about 7 seconds. Pretty good, right? Cool. Okay, we can turn the placeholder back on and what that will do in the background is that will go off and start another node. Sorry, the placeholder was already on. Sorry, I didn't need to do it. The placeholder was there. So in the background or what's already happening is that the scheduler has started booting another node. So we can do that again in my second cluster. Just stall showing the number of nodes. Fine. Okay. Maybe we won't do it in the second cluster. Something weird is going on there. Fine. At least I had two plans. Cool. Okay, that was the stressful bit out of the way. The slides don't change. Cool. So to prove that this is a real cluster, this is what was just going on just a second ago, looking at the events that were kind of streaming out the cluster, demonstrating the nodes coming up and down and the pods being scaled up and down as I went. We can skip past these because this is the backup videos that I had of it working. Cool. So how did I make this all happen, though? So firstly, we need a placeholder, and that needs to be big enough to know that it will never be schedulable alongside any real workload on a node. So it should be sized big enough in order to fill the whole compute node. Then you need to specify a low priority class in order to make sure that it's evicted as soon as there's a real workload that comes along and needs that capacity. So the placeholder pod competes for resources. So we need to define that. We want it to have that low priority with a priority of minus one. All other pods, even by default ones, will have precedents, and the placeholder is then evicted as soon as the cluster runs out of space. We should hopefully almost maybe. There we go. Bingo. So we've now got our over provisioned node, and if I scale back down to one and turn the placeholder off, we can come back to that at the end and hopefully see that the auto scaler has reduced us back down to zero. That could take a minute or so for the cluster auto scaler to recognize that. Okay, so now for a demo of how this works with autoscaling. However, though. So just now we demoed it with kind of point and click to change the replica count. How does this work in a more real worldish type scenario where you may have your kind of horizontal portal scaling? I'll be honest with you, I have definitely exhausted my credit with the demo gods. So I'm going to be playing some videos here and provide a bit of narration and also save us all hanging around for the three minutes or so that it takes before I start to provide a little orientation. On the left hand side, we can see the request per second that we're serving. The bottom left, you can see the nodes and the pods on them. And you can see my nodes in here again, can take up to four workload pods. I've got a simple application that can handle a fixed number of requests and I'm ramping up traffic. As you can see, we start with two nodes, and we're able to kind of sustainably handle the traffic as it increases in until we fill both nodes. And at that point that the HPA has decided to scale up, and then we finally see the cluster auto scaler in a minute or two's time that I can fast forward start to actually provide the more nodes that will be required. Skip forward a bit. There we go. Cluster auto scale has now come in and provided the extra nodes. So I've manipulated these results a little so as not to leave you waiting too long, which be about around four minutes for the nodes to be available. So in that time, while we were waiting for the nodes to be available, well, the traffic that we were able to kind of service as an application really kind of flattens out. But as soon as we've kind of got the resources, it kind of goes up again, and we can see the recovery curve coming in. Okay, so next round. Yeah, cool. So now we can compare that to our more proactive pattern where we have a placeholder pod that's keeping us a spare node ready at all times. As the traffic builds up, we can see that that placeholder quickly becomes evicted and our workload pods become scheduled. On that node, a new placeholder pod is created as pending, causing the cluster autoscaling to go off and create a new node. So sometimes, however, as happens here, we'll see the traffic buildup in the HPA outpace, the speed at which we were able to stand up the new nodes. There you go. The placeholder pod didn't actually get landed on the nodes until sometime later. As you can see, we're adding nodes, and immediately, we're not even getting the chance to schedule the over provisioning pod in them, though the over provisioning pod has always made sure that we've always had a request in flight to get a new node. But the result of this ultimately is, as you can see here, is that the traffic steadily kind of paces up. There's some bumps, but we can see that there's no kind of flasting out. So there's no point where we actually are failing requests at any point. So this all comes at an inevitable cost. Your plan is, in this case, to always have extra capacity ready and waiting for your workload to require it. What might therefore be some better answers, though? Well, you can tune your workload in order to make sure that you're not leaving gaps. Or better yet, remember that pod priority thing that we used? Well, if you've got workload that suits it on your cluster that could like to run and would give you more return on investment than just a placeholder. So stuff that can handle stopping and starting when it's appropriate to. So perhaps some housekeeping, some analytics and machine learning or maybe just less important services. So you might want to say prioritize the shopping cart supporting applications and nodes over say the customer service desk ones that you can structure, allowing you to structure your cluster workload to be more aligned to your business benefits and goals. I've been Chris Nesbittsmith thank you again for joining me today. Like subscribe, whatever the kids do on LinkedIn, GitHub, whatever you can be. Rest assured there'll be little to no spam since I'm much content at all since I'm awful at self promotion, especially on social media. CNS me just points at my LinkedIn talks. CNS me contains this and some other talks and they're all open.
...

Chris Nesbitt-Smith

Consultant @ UK Government

Chris Nesbitt-Smith's LinkedIn account Chris Nesbitt-Smith's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways