Transcript
This transcript was autogenerated. To make changes, submit a PR.
Cool. Awesome few tech, eh?
So to kick things off, my name is Chris Nesbittsmith.
I'm based in London and currently an instructor for learn KX
and a consultant to various bits of UK government and a
tinkerer of open source stuff. I've been using and
abusing kubernetes in production since it was 0.4,
so believe me when I say it's been a journey. I've certainly
got the scars and the war wounds to show for it.
So you believe the hype that Kubernetes
lets you scale infinitely, auto heal your cluster and
so on. So your cluster is self monitoring, it's scaling up instances
of your cloud native stateless applications on demand
when you need more,
but all of a sudden your nodes are full, you can
scale no more. Well, enter the cluster autoscaling
and of course a splash of yaml in order to save the day.
And that can integrate with your cloud vendor to provision
more necessary nodes. And the good news is that
the autoscaler is configurable. Though sadly,
as we'll see, it's not quite as configurable as you might hope
or expect. There are alternatives, but the
official cluster autoscaling only scales up when there's pending pods
in order to satisfy the demand, which is probably a good idea
since there's little point adding more nodes unless you have
the workload that needs them. Okay,
so first let's refresh ourselves on how the Kubernetes
kind of scheduler works. So if I create a deployment
with two replicas, I do this by submitting some
Yaml to the API server, which then writes it to etcd.
The controller is watching for this type of event,
recognizes it needs to go and create some pods, which it does,
and then these are now pending. The scheduler is that there's
a component that is looking for pending nodes, sees these, and then
schedules or assigns them to a node.
The scheduling, however, is broken down into a few steps from the
initial queue through filtering viable nodes
to then scoring them before actually finally creating the binding.
But how does the scheduler know how much memory and cpu
a pod uses? Well, it doesn't,
strictly speaking. So you need to spoon feed this with requests
and limits. So if you don't specify requests and limits,
kubernetes will play completely blind. Your cluster will inevitably become
overloaded, nodes will become oversubscribed, and you'll be constantly
fighting fires. So if your only takeaway from any of this is
that all your containers should definitely have requests and limits defined,
then we've at least achieved something useful here. So requests
are the initial ask and limits are the point where the container will be throttled
at cpu or killed if it exceeds the memory.
Okay, so applications come in all sorts of shapes
and sizes. So you may have some applications that are more cpu intensive
and don't require much memory, while on the other hand you may have others
that have a greater memory than a cpu footprint.
So those applications have to be deployed inside computing
units which have again their own cpu and memory kind of
characteristics. So for every application deployed
in a cluster, Kubernetes makes a note of the memory and cpu
requirements. It then decides where to place the application in
the cluster. In this case, it's on the far left node.
If another application of the same size is deployed, well, Kubernetes goes through that
same process and finds the best node to run the app.
In this case, it picks the right hand node.
As more applications are submitted to the cluster, Kubernetes keeps
making nodes of the cpu and memory requirements
and allocating these applications in the cluster.
If you play this game long enough, you might notice that Kubernetes appears
at least to be a reasonably skilled tetris player. So your servers of the
board, your apps of the blocks and Kubernetes is trying to
fit as many blocks as efficiently as possible.
But what about the size of the worker nodes? Well, what kind of
instance types can you use to build your cluster? Well, nowadays the cloud vendors
make almost every instance type available to be part of
the cluster, so you've got pretty much free choice.
There is a catch though, so could be forgiven for thinking
that if you get an eight gig ram and two cpu node from your
cloud vendor, you could deploy four pods that
are one and a half gig ram and need a quarter of a cpu.
However, it's not quite so one of those pods
would remain pending, which if configured will of course cause
the cluster autoscaler to go and create a new node and
then eventually your workload becomes scheduled.
But why is this so? When you provision a managed
instance, you might think that the memory and cpu available
can be used for running pods. And you are right.
However, some memory and cpu should be reserved for
the operating system and you should also reserve memory
and cpu for the Kubelet.
But surely the rest is available to my pods, right?
Well, not quite yet. So you also need to reserve some memory for the eviction
threshold. So if the kubelet notices that the memory
usage is going over that point, it will start evicting nodes.
Your cloud vendor will usually choose these numbers for you.
For example, say AWS typically reserves 255 meg memory
for kubernetes, eleven meg for each pod that you
can deploy on that instance. So this is the reserved
memory for the Kubernetes. So the cpu reserved is usually around zero
three to zero four of a cpu for the operating system.
They reserve 100 meg of memory and zero one cpu
for the eviction threshold and another 100 meg.
So in AWS if you select an m five large, though, here's a visual
recap of how the resources are subdivided. With this particular instance,
you can deploy 27 nodes.
The other thing to consider is all the time that this takes.
So let's assume that you've configured your horizontal pod autoscaler or
HPA in order to scale up your pods dynamically. So well,
that's where the journey probably starts.
So to start with, about 90 seconds is what's needed
for your horizontal pod autoscaler to react and decide
to scale up your application.
Then the cluster autoscaler then takes around say 30 seconds
to request a new node from the cloud vendor, then around
three to four minutes for the machine to boot,
and then around another 30 seconds for it to join the cluster and then be
ready to run workloads. Then you can of course add on
time for pulling your container image that won't be cached on this brand
new machine as well record. So to help visualize
the impact of this this can have I did a library last
year that actually fakes a Kubernetes scheduler. It allows you to specify
many different types of pods and model their
scaling dynamics, tracking container startup times, so on,
and define your node properties. So it takes a lot of shortcuts in
order to provide hundreds of thousands of intervals representing kind of days
and do that in tens of milliseconds. It's not the real Kubernetes
scheduler. Pull requests are very welcome if you'd like to improve it.
So to give you a way to play with that, I also made a game
as a novelty for Kubecon last year called Black Friday.
The scenario is that you're an SRE team supporting a retailer
facing a spike in traffic on Black Friday, and then again on
cyber Monday with a lull between and a calm before
and after she sees
some yes, it's a
three tier service of a front end, back end and database,
all of which have different kind of scaling properties,
startup times, et cetera. So we can see some
of the details here in the hints. So we can see the properties
on our nodes or different node types,
and we can see the profile of the
front end and similar for the back end. A bit zoomed in at the minute
on this presentation. So you can see here.
Here's my Friday spike, and here's my cyber Monday
spike and how the heuristics of how that trials off.
So the goal is to configure your cluster
to as closely follow the spike with just enough
infrastructure. So failing some requests and getting a
few SLA penalties might actually result in a greater
profit ultimately. So let's
have a kind of play on this. So if I, for example,
change my, say, minimum node count down to one,
and similarly, just let's, let's see what this looks like. If I
drop everything down to start at one instance of everything,
1015,
you can see I can change the point on when I scale my,
when I scale up these, these pods. So let's
see. Hopefully that should schedule out and we'll see some failed requests. Yeah,
love you. Maybe live demos. Hey.
Yep, that's live demo failing.
Maybe let's go with. Go with two.
Failing to schedule my first interval.
Okay, so we can see that in this, we've reached a
point where we've
got some failed requests right here and
some more failed requests on Cyber Monday. So this is where my infrastructure scaling
and my pod auto scaling failed to follow the
spike up of demand as closely as it needed to.
But that has actually only caused me a
penalty in this scenario of $1,154.
So, in effect, my company is in profit based
on where it presumed it would be if it didn't have any auto scaling of
about $15,000. So maybe not a bad idea.
Perhaps. And you can tune and tweak these numbers.
So please do feel free to play with this.
May the odds ever be in your favor. Ultimately, this is an
experiment to kind of play with and try and get your name on the leaderboard.
Cool. Okay, first live demo out the way.
What do we do in order to stack some of the nodes in our favor?
Well, we can not scale at all. That is always an option
and often overlooked.
Or what if we could get a head start on the scaling?
So maybe not scaling at all. Sounds a bit flippant, but what
do I really mean by that? Well, going back to our scenario of fitting
our pods on a machine, taking into account the reserves
for the Kubernetes, if we sized our machine correctly,
we might be able to fit all of our workload in a node. So this
isn't easy, given the vast array of possible machine sizes.
So we've done some of the hard work for you in this space and created
an instance calculator. Next live demo.
So with this I can tune and tweak
the configuration of my nodes so I can say that I want to look
at the efficiency and how many pods I
can fit on it, so I can size my pods up in memory and see
that dynamically move around. So I can see that I've not got a particularly efficient
nodes size. Now I can flip around, I can see that
other nodes offer a bearing level of efficiency if
my pod looks like it requires these sorts of kind of properties,
and I can densely stack my things so I can hunt
around for different cloud vendors and pick the right tool,
pick the right node for the type, and I can optimize for either
cpu or indeed for memory as well,
and change some of the properties around demon sets and agents
and things like that.
Cool. Okay, so finally onto the topic of this webinar.
All the waits over, he says
his clicker stops working. So what if we could always
have at least one node that's ready for when you need it? So removing
that three and a bit minute wait. Well, to do this, we can
create a placeholder pod that as soon as your workload comes
along needing the resource, the placeholder pod becomes evicted,
causing your cluster autoscaler to boot a new machine in
order to host the new replacement placeholder. And this will continue
as you scale into further nodes, keeping you always one
step ahead. Okay, now to praise the demo
gods again, where I do a real live demo and hope that everything works
with a real Kubernetes cluster. I have backup
plans of videos if not. So let's have a look.
So behind the scenes here, there is a real
Kubernetes cluster, and I can prove that in a minute. So we've got
a simple application where we can see the effects of clicking on
the scale buttons. So if I say to scale to five,
we start with one replica, okay.
And clicking the five, hopefully that should start across
my fingers and hope that the thing that I built works.
Cool. Hopefully. So they're pending.
Never works. We have a backup plan.
Finer they
are all there for the browser not loading the timer.
Right. So what you would see is this timer starting to count
up. We'll ignore it for the minutes, but it will add
a. We'll check back in a
sec, but what you'll see is this timer should elapse to three minutes.
So if you look at the time of how long we've been talking here,
it will be about three minutes in order to scale up to
have two nodes. So in the minute the cluster auto scaler is going off and
asking for a new node, we can do that in a separate
instance over here. So I've got two cluster to
actually at least show you the timer that actually works. It did.
Okay, fine. Live demos.
Okay, similarly, so assume that this is like plus 10
seconds, the other one's plus how many minutes, fine, we'll come back to them.
We started one replica, we scaled to five. So the current node
gets saturated with the four nodes and one is left pending behind the scenes.
So now the cluster autoscaling is going to request a new node from
Linode. So while I stall for about three minutes of what would
otherwise be kind of radio silence of me praying for it to kind of work
properly, are there any questions I can come
to? So, Sam, if we create
a prescaled node for each node pool,
would that then cost high?
Right, so how to turn them off when we're not in use?
So, yeah, if you've got a node that's hanging around that's
basically just running a placeholder, then yeah,
your plan is to ultimately
waste some money, ultimately. So your plan is to keep
that spare. So this isn't fundamentally an uncommon pattern
that you might be used to seeing in
other worlds where you might have a raid array, where you might have a hot
spare in that, because sure, you can go and get a new
drive or something or a new blade unit to throw in the chassis, but if
you've got one already wrapped up, it saves you a trip to the data center.
So likewise, albeit that's a timeline that normally is measured in
best hours, probably days. This is, we're trying
to now shave minutes off. So we're at a better place than we might have
been with more traditional infrastructure and hardware.
But I'll come to some of that point at the end how you
might be able to lessen the financial impact
of this sort of pattern.
So we mentioned oversubscribing the nodes with
requests and limits. How does
that work when we oversubscribe our vms in VMware? So if
you oversubscribe your nodes in
the hypervisor, then you are in
for a bad day. I guess it's a
very short answer. Kubernetes is not
how it's designed to work in the same way as you'll notice that Kubernetes does
not play well with a swap for
memory. It relies on kind of real resource, and you need to really
kind of give kubernetes as
much as it can get in terms of empowerment to kind
of know what's going on. I think
on the first node I can see. So I've got two clusters here, there's two
tabs. This first cluster has now got its
second node. So we should see a second pod.
So once that's pending at the minute, I want the other nodes. There we go.
Fine. Okay, so that took about three minutes.
So sorry, I'll just finish off that question. So,
yeah, you really need to give kubernetes as much information as you can in
order for it to make the best decision. So like help it, help you
ultimately tell it what's going on.
That's okay. So as you can see, that took about
three minutes for my pod to scale from one to five.
The first bit of it was done within about 10 seconds and
then the last kind of pending pod took a bit longer.
So let's see what the difference is
if I use my placeholder pattern. So what I'm going to do is
I'm going to scale back down to one should hopefully
go and kill my pods. Cool. And then
I'm going to deploy my placeholder. So I've shown a minute, this is
a real Kubernetes cluster. I've just got some Javascript and
a web page that make it like submitting some yaml, just so you've
got a visual view of what's going on my other screen, I am panically
watching the actual kind of the logs and the events. So now
I've got a node here. So I'm back down to my original scenario
where I have my one instance of the application running and I
have my placeholder nodes keeping the other nodes
present. So it's a kind of hack on the auto scaler just to keep
like a node available.
So now if I click scale to five,
I'll quite quickly see that it saturated the first node and in a
few seconds. 7 seconds, there we go. My container
has now booted in the second node. So we've gone from
near on four minutes down to about 7 seconds.
Pretty good, right? Cool.
Okay,
we can turn the placeholder back on and what that will do in
the background is that will go off and start another node.
Sorry, the placeholder was already on. Sorry, I didn't need to do it. The placeholder
was there. So in the background or what's already happening is that the scheduler
has started booting another node.
So we can do that again in my second cluster.
Just stall showing the number of nodes. Fine. Okay. Maybe we
won't do it in the second cluster. Something weird is going on there.
Fine. At least I had two plans.
Cool. Okay, that was the stressful bit out of the way.
The slides don't change.
Cool. So to prove that this is a real cluster, this is
what was just going on just a second ago,
looking at the events that were kind
of streaming out the cluster, demonstrating the nodes coming up and down and the pods
being scaled up and down as I went.
We can skip past these because this is the backup videos that I had of
it working. Cool. So how did I make this all
happen, though? So firstly, we need a placeholder,
and that needs to be big enough to know that it will never be schedulable
alongside any real workload on a node.
So it should be sized big enough in order to fill
the whole compute node. Then you need to specify
a low priority class in order to make sure that it's evicted as
soon as there's a real workload that comes along and needs that capacity.
So the placeholder pod competes for resources.
So we need to define that. We want it to have that low priority
with a priority of minus one. All other pods,
even by default ones, will have precedents,
and the placeholder is then evicted as soon as the cluster runs out of space.
We should hopefully almost maybe.
There we go. Bingo. So we've now got our over
provisioned node, and if I scale back down to
one and turn the placeholder off, we can come back
to that at the end and hopefully see that the
auto scaler has reduced us back down to zero. That could take
a minute or so for the cluster auto scaler to recognize that.
Okay, so now for a demo of how this works
with autoscaling. However, though. So just now we demoed it with
kind of point and click to change the replica count. How does this work
in a more real worldish type scenario where you may have your kind
of horizontal portal scaling? I'll be
honest with you, I have definitely exhausted my credit with the
demo gods. So I'm going to be playing some videos here and provide a
bit of narration and also save us all hanging around for the three minutes or
so that it takes before
I start to provide a little orientation. On the left hand side,
we can see the request per second that we're serving. The bottom
left, you can see the nodes and the pods on them. And you can
see my nodes in here again, can take up to four workload
pods. I've got a simple application that can handle
a fixed number of requests and I'm ramping up traffic.
As you can see, we start with two nodes, and we're able to kind of
sustainably handle the traffic as it increases in
until we fill both nodes. And at that point that
the HPA has decided to scale up, and then we
finally see the cluster auto scaler in a minute or
two's time that I can fast forward start to actually provide
the more nodes that will be required.
Skip forward a bit. There we go.
Cluster auto scale has now come in and provided the extra nodes. So I've
manipulated these results a little so as not to leave you waiting too long,
which be about around four minutes for
the nodes to be available. So in that time,
while we were waiting for the nodes to be available, well, the traffic that we
were able to kind of service as an application
really kind of flattens out. But as soon as we've kind of got the resources,
it kind of goes up again, and we can see the recovery curve coming in.
Okay, so next round.
Yeah, cool. So now we can compare that to our more proactive pattern
where we have a placeholder pod that's keeping us a spare
node ready at all times. As the traffic builds up,
we can see that that placeholder quickly becomes evicted and our workload pods
become scheduled. On that node, a new placeholder
pod is created as pending, causing the cluster autoscaling
to go off and create a new node. So sometimes, however, as happens here,
we'll see the traffic buildup in the HPA outpace, the speed at
which we were able to stand up the new nodes.
There you go. The placeholder pod
didn't actually get landed on the nodes until sometime later.
As you can see, we're adding nodes, and immediately, we're not even
getting the chance to schedule the over provisioning pod
in them, though the over provisioning pod has always made sure
that we've always had a request in flight to get a new node.
But the result of this ultimately is, as you can see here, is that the
traffic steadily kind of paces up. There's some
bumps, but we can see that there's no kind of
flasting out. So there's no point where we actually are
failing requests at any point.
So this all comes at an inevitable cost. Your plan is,
in this case, to always have extra capacity ready and waiting
for your workload to require it. What might therefore
be some better answers, though? Well, you can tune your
workload in order to make sure that you're not leaving gaps.
Or better yet, remember that pod priority thing that we used? Well,
if you've got workload that suits it on your cluster that could like
to run and would give you more return on investment than just a placeholder.
So stuff that can handle stopping and starting when
it's appropriate to. So perhaps some housekeeping, some analytics
and machine learning or maybe just less important services.
So you might want to say prioritize the shopping cart supporting
applications and nodes over say the customer service desk ones
that you can structure, allowing you to structure your cluster
workload to be more aligned to your business benefits and goals.
I've been Chris Nesbittsmith thank you again for joining me today.
Like subscribe, whatever the kids do on LinkedIn, GitHub, whatever you can
be. Rest assured there'll be little to no spam since I'm much content
at all since I'm awful at self promotion, especially on social media.
CNS me just points at my LinkedIn talks. CNS me
contains this and some other talks and they're all open.