Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, thank you so much for joining me here today. It'd be great
to hear where you're all from, so please do leave a comment in the chat
and introduce yourself where you're coming from. Likewise,
please use the q and a in the comments if you've got any questions
throughout this webinar, and I'll do my best to get to them at the end.
I'm also joined by some of my friends at Learn kits, so shout out
to Salman and some of my friends at Linode who will also be kind
of assisting in the chat, who may get to your questions
before I do. So to kick things off, my name's Chris Nesbittsmith.
I'm based in London and currently an instructor for Learn
KX and a consultant to various bits of UK government and
a tinkerer of open source stuff. I've been using and
abusing Kubernetes in production since it was 0.4,
so believe me when I say it's been a journey. I've certainly got the
scars and the war wounds to show for it. So you
believe the hype? The Kubernetes lets you scale infinitely,
auto heal your cluster, and so on. So your cluster is self
monitoring. It's scaling up instances of your cloud native,
stateless applications on demand when you need more,
but all of a sudden your nodes are full, you can scale
no more. Well, enter the cluster autoscaler
and of course a splash of yaml in order to save the day,
and that can integrate with your cloud vendor to provision
more necessary nodes. And the good news is that
the autoscaler is configurable.
Though sadly, as we'll see, it's not quite as configurable
as you might hope or expect. There are
alternatives, but the official cluster autoscaler only scales
up when there's pending nodes in order to satisfy
the demand. Which is probably a good idea, since there's
little point adding more nodes unless you have the workload that
needs them. Okay, so first let's
refresh ourselves on how the Kubernetes kind of scheduler works.
So if I create a deployment with two replicas,
I do this by submitting some Yaml to the API server, which then writes
it to etcd. The controller is watching for this type of
event, recognizes it needs to go and create some pods,
which it does, and then these are now pending.
The scheduler is a component that is looking for pending pods,
sees these, and then schedules or assigns them to a node.
The scheduling, however, is broken down into a few steps from
the initial queue through filtering viable nodes
to then scoring them before actually finally creating the binding.
But how does the scheduler know how much memory and cpu
a pod uses? Well, it doesn't strictly
speaking. So you need to spoon feed this with requests and
limits. So if you don't specify requests and limits,
Kubernetes will play completely blind. Your cluster will
inevitably become overloaded, nodes will become oversubscribed,
and you'll be consultant fighting by's. So if your only
takeaway from any of this is that all your containers should definitely have
requests and limits defined, then we've at least achieved something useful
here. So requests are the initial ask and limits are the point where
the container will be throttled at cpu or killed if it exceeds the memory.
Okay, so applications come in all sorts of shapes and
sizes. So you may have some applications that are more cpu
intensive and don't require much memory, while on the other hand you
may have others that have a greater memory than a cpu footprint
it. So those applications have to be deployed
inside computing units which have
again their own cpu and memory kind of characteristics.
So for every application deployed in a cluster,
Kubernetes makes a note of the memory and cpu requirements.
It then decides where to place the application in the cluster. In this
case, it's on the far left nodes. If another application of
the same size is deployed well, Kubernetes goes through that same process and
finds the best node to run the app. In this case, it picks the right
hand nodes. As more applications are submitted to
the cluster, Kubernetes keeps making notes of
the cpu and memory requirements and allocating
these applications in the cluster. If you play this game long
enough, you might notice that Kubernetes appears at least to be a reasonably
skilled Tetris player. So your service of the board, your apps of the blocks and
Kubernetes is trying to fit as many blocks as efficiently as possible.
But what about the size of the worker nodes? Well, what kind of instance
types can you use to build your cluster? Well, nowadays the cloud
vendors make almost every instance type available to be
part of the cluster, so you've got pretty much free choice.
There is a catch though, so you'd be forgiven for thinking
that if you get an eight gig ram and two cpu node from
your cloud vendor, you could deploy four pods
that are one and a half gig ram and need a quarter of a cpu.
However, it's not quite so one of those
pods would remain pending, which if configured, will of course
cause the cluster autoscaling to go and create a new
node and then eventually your workload becomes scheduled.
But why is this so? When you provision a managed
instance, you might think that the memory and cpu available
can be used for running pods. And you are right.
However, some memory and cpu should be reserved
for the operating system and you should also reserve memory
and cpu for the Kubelet. But surely
the rest is available to my pods, right? Well, not quite yet.
So you also need to reserve some memory for the eviction threshold.
So if the Kubernetes notices that the memory usage is going over
that point, it will start evicting nodes.
Your cloud vendor will usually choose these numbers for
you. For example, say AWS typically reserves 255
meg of memory for kubernetes, eleven meg for each
pod that you can deploy on that instance.
So this is the reserved memory for the cubelet. So the cpu reserved is
usually around zero three to zero four of a cpu.
For the operating system. They reserve 100 meg of memory
and zero one cpu for the eviction threshold and
another 100 meg. So in AWS
if you select an m five large, here's a visual recap of how
the resources are subdivided. With this particular instance you can deploy
27 pods. The other thing to consider
is all the time that this takes. So let's
assume that you've configured your horizontal pod autoscaler or HPA
in order to scale up your pods dynamically. So well that's
where the journey probably starts.
So to start with, about 90 seconds is what's
needed for your horizontal pod autoscaler to react and decide
to scale up your application.
Then the cluster autoscaler then takes around say 30 seconds
to request a new node from the cloud vendor,
then around three to four minutes for the machine to boot,
and then around another 30 seconds for it to join the cluster and then be
ready to run workloads. Then you can of course add on
time for pulling your container image that won't be cached on this
brand new machine as well. So to help visualize the
impact of this this can have I did made a library last
year that actually fakes a Kubernetes scheduler. It allows
you to specify many different types of nodes and model their
scaling dynamics, tracking container startup time, so on,
and define your node properties. So it takes a lot of shortcuts
in order to provide hundreds of thousands of intervals representing kind
of days, and do that in tens of milliseconds. It's not
the real Kubernetes scheduler pull requests are very welcome if you'd like to improve
it. So to give you a way to play with that.
I also made a game as a novelty for Kubecon last year
called Black Friday. The scenario is that you're an SRE team
supporting a retailer facing a spike in traffic
on Black Friday, and then again on Cyber Monday, with a lull
between and a calm before and after she
sees some things.
It's a three tier service of a front end, back end,
and database, all of which have different kind of scaling
properties, startup times, et cetera.
So we can see some of the details here in the hints. So we
can see the properties on our node or different
node types, and we can see the profile
of the front end and similar for the back end. It's a bit zoomed
in at the minute on this presentation. So you can see here. Here's my Friday
spike, and here's my cyber Monday spike and how
the heuristics of how that trials off.
So the goal is to configure your cluster to
as closely follow the spike with just enough infrastructure.
So failing some requests and getting a few SLA penalties might
actually result in a greater profit ultimately.
So let's have a kind of play on this.
So if I, for example, change my, say, minimum node count
down to one, and similarly,
just let's, let's see what this looks like. If I drop everything
down to start at one instance of everything 1015,
you can see I can change the point on when I scale my,
when I scale up these, these pods.
So let's see. Hopefully that should schedule out and we'll see some failed requests.
Yeah, love you. Maybe live demos.
Hey,
yeah, that's live demo failing.
Maybe let's go with. Go with two.
Failing to schedule my first interval, I should have.
Okay, so we can see that in this,
we've reached a point where
we've got some failed requests right here and
some more failed requests on cyber Monday. So this is where my infrastructure
scaling and my pod auto scaling failed to follow the
spike up of demand as closely as it needed to.
But that has actually only caused me a penalty
in this scenario of $1,154. So,
in effect, my company is in profit based on
where it presumed it would be if it didn't have any autoscaling of about $15,000.
So maybe not a bad idea, perhaps. And you can
tune and tweak these numbers. So please do feel free to
play with this. May the nodes ever be in your favor.
Ultimately, this is an experiment to kind of play with and try
and get your name on the leaderboard.
Cool. Okay, first live demo out the way,
what do we do in order to stack some of the odds in our favor?
Well, we can not scale at all. That is always an option
and often overlooked.
Or what if we could get a head start on the scaling?
So maybe not scaling at all? Sounds a bit flippant, but what do
I really mean by that? Well, going back to our scenario of fitting our
pods on a machine, taking into account the reserves for
the kubernetes, if we sized our machine correctly,
we might be able to fit all of our workload in a node. So this
isn't easy given the vast array of possible machine sizes.
So we've done some of the hard work for you in this space and created
an instance calculator. Next live demo.
So with this I can tune and tweak the
configuration of my pod so I can say that I want to look
at the efficiency and how many pods
I can fit on it, so I can size my pods up in memory and
see that dynamically move around. So I can see that I've not got a particularly
efficient node size. Now I can flip around,
I can see that other nodes offer a bearing
level of efficiency if my pod looks like it requires these
sorts of kind of properties, and I can densely
stack my things so I can hunt around for different cloud vendors
and pick the right tool, pick the right node for the
type, and I can optimize for either cpu or
indeed for memory as well, and kind of change some
of the properties around the demon sets and agents and things like that.
Cool. Okay, so finally onto the topic of
this webinar. All the waits over,
he says his clicker stops working. So what if we could always
have at least one node that's ready for when you need it.
So removing that three and a bit minute wait.
Well, to do this, we can create a placeholder pod
that as soon as your workload comes along needing the resource,
the placeholder pod becomes evicted, causing your
cluster autoscaling to boot a new machine in order to
host the new replacement placeholder. And this will continue
as you scale into further nodes, keeping you always one
step ahead. Okay, now to praise
the demo gods again, where I do a real live demo and hope that
everything works with a real Kubernetes cluster.
I have backup plans of videos if not. So let's
have a look. So behind the scenes here,
there is a real Kubernetes cluster, and I can prove that in
a minute. So we've got a simple application where we can see the effects
of clicking on the scale buttons. So if I say to scale
to five. We start with
one replica. Okay. And clicking the
five. Hopefully that should start little huck.
So I cross my fingers and hope that the thing that I built works.
Cool. Hopefully. So they're pending.
Never works. We have a backup plan.
They are all there for the browser. Not loading
the timer. Right. So what you would see is this timer starting to could
up. We'll ignore it for the minute, but it will add a.
We'll check back in a sec. But what you'll see is this timer should
elapse for three minutes. So if you look at the time of how long we've
been talking here, it will be about three minutes in
order to scale up to have two nodes. So the minute the
cluster auto scaler is going off and asking for a new node,
we can do that in a separate instance over here. So I've got two
clusters to
actually at least show you the timer that actually works. It did.
Okay, fine. Live demos.
Okay, similarly. So assume that this is like plus 10 seconds
and the other one's plus how many minutes, fine, we'll come back to them.
We started one replica. We scaled to five. So the current node
gets saturated with the four pods, and one is left pending behind the scenes.
So now the cluster auto scaler is going to request a new node from
Linode. So, while I stall for about three minutes of
what would otherwise be kind of radio silence of me praying for it to kind
of work properly, are there any questions I
can come to? So,
Sam, if we create a prescaled node for each node
pool, would that then cost high? Right,
so how to turn them off when we're not in use?
So, yeah, if you've got a node that's hanging around that's
basically just running a placeholder, then, yeah.
Your plan is to ultimately
waste some money. Ultimately. So your plan is to keep
that spare. This isn't fundamentally an uncommon
pattern that you might be used to seeing in
other worlds where you might have a raid array, where you might have a hot
spare in that, because, sure, you can go and get a
new drive or something or a new blade unit to throw in the chassis,
but if you've got one already wrapped up, it saves you a trip to the
data center. So likewise, albeit that's a timeline
that normally is measured in best hours, probably days.
This is. We're trying to now shave minutes off so we're at a better place
than we might have been with more traditional infrastructure and hardware.
But I'll come to some of that point at the end how you might
be able to lessen the financial impact of
this sort of pattern.
So we mentioned oversubscribing the nodes
with requests and limits. How does
that work when we oversubscribe our vms in VMware? So if
you oversubscribe your nodes in the
hypervisor, then you are in for a
bad day. I guess it's a very short answer.
Kubernetes is not how it's designed to work in the
same way as you'll notice that Kubernetes does not play well with a swap
for memory. It relies on kind of real resource and you need
to really kind of give kubernetes as
much as it can get in terms of empowerment
to kind of know what's going on.
I think on the first node I can see, so I've
got two clusters here, there's two tabs. This first
cluster has now got its second nodes, so we should see a second
pod. So once that's pending at the minute, I want the other nodes.
There we go. Fine. Okay, so that took about
three minutes. So I'll just finish off that question.
So yeah, you really need to give Kubernetes as much information
as you can in order for it to make the best decision. So like
help it, help you ultimately tell it what's going
on. That's okay. So as
you can see, that took about three minutes for my pod to scale from one
to five. The first bit of it was done within about 10 seconds
and then the last kind of pending pod took a bit longer.
So let's see what
the difference is if I use my placeholder pattern.
So what I'm going to do is I'm going to scale back down to one
should hopefully go and kill my pods. Cool.
And then I'm going to deploy my placeholder. So I'll show you in
a minute. This is a real Kubernetes cluster.
I've just got some Javascript and a web page that
makes like submitting some yaml, just so you got a visual view of what's
going on. On my other screen, I am panically watching
the actual kind of the logs and the events.
So now I've got a node here. So I'm back down to
my original scenario where I have one instance of the application running
and I have my placeholder nodes keeping the
other node present. So it's a kind of hack on the auto scaler just to
keep like a node available. So now
if I click scale to five,
I'll quite quickly see that it saturated the first node and
in a few seconds,
7 seconds. There we go. My container has now booted
in the second node. So we've gone from kind of near on
four minutes down to about 7 seconds. Pretty good,
right? Cool. Okay,
we can turn the placeholder back on and what that
will do in the background is that will go off and start another node.
Sorry, the placeholder was already on. Sorry, I didn't need to do it. So the
placeholder was there. So in the background or what's already happening is that the
scheduler has started booting another node.
So we can do that again in my second cluster,
which has stalled, showing the number of pods. Fine.
Okay, maybe we won't do it in the second cluster. Something weird is going on
there.
Fine. At least I had two plans. Cool.
Okay. That was the stressful bit out of the way,
and then the slides don't change.
Could. So to prove that this is a real cluster,
this is what was just going on just a second ago,
looking at the events that were kind of
streaming out the cluster, demonstrating the nodes coming up and down and the pods being
scaled up and down as I went. We can
skip past these because this is the backup videos that I had of it working.
Cool. So how did I make this all
happen though? So firstly, we need a placeholder,
and that needs to be big enough to know that it will never be schedulable
alongside any real workload on a node. So it should
be sized big enough in order to fill the whole
compute node. Then you need to specify a
low priority class in order to make sure that it's evicted as
soon as there's a real workload that comes along and needs that capacity.
So the placeholder pod competes for resources.
So we need to define that. We want it to have that low
priority with a priority of minutes one. All other pods,
even by default ones, will have precedents
and the placeholder is then evicted as soon as the cluster runs out of space.
If I flip back, we should hopefully almost,
maybe. There we go. Bingo. So we've now got our
over provisioned nodes, and if I scale back down to one
and turn the placeholder off, we can come back to that at the end and
hopefully see that the auto scaler
has reduced us back down to zero. That will take a minute or so
for the cluster auto scaler to recognize that.
Okay, so now for a demo of how this works
with auto scaling, however, though. So just now we demoed
it with kind of point and click to change the replica
count. How does this work in a more real worldish type scenario where
you may have your kind of horizontal port auto scaling.
I'll be honest with you, I have definitely exhausted
my credit with the demo gods. So I'm going to be playing some videos here
and provide a bit of narration and also save us all hanging around for kind
of the three minutes or so that it takes.
So before
I start to provide a little orientation on the left hand side,
we can see the request per second that we're serving. The bottom
left, you can see the nodes and the pods on them. And you
can see my nodes in here again, can take up to four
workload pods. I've got a simple application that can handle
a fixed number of requests, and I'm ramping up traffic.
As you can see, we start with two nodes and we're able to kind of
sustainably handle the traffic as it increases in until
we fill both nodes. And at that point that the HPA
has decided to scale up and then we finally see the cluster auto
scaler in a minutes or so of time that I can fast forward
start to actually provide the more nodes that will be
required.
Skip forward a bit. There we go.
The cluster also scale has now come in and provided the extra nodes.
So I've manipulated these results a little so as not to leave you waiting too
long, which beat about around four minutes for the
nodes to be available. So in that time, while we
were waiting for the nodes to be available, well, the traffic that we were able
to kind of service as an application really kind
of flattens out. But as soon as we've kind of got the resources,
it kind of goes up again and we can see the recovery curve coming
in. Okay,
so next round.
Yeah, cool. So now we can compare that to our more proactive pattern
where we have a placeholder pod that's keeping us a spare node ready
at all times. As the traffic builds up, we can see that that placeholder
quickly becomes evicted and our workload pods become scheduled on
that node, a new placeholder pod is created as pending,
causing the cluster autoscaler to go off and create a new node. So sometimes,
however, as happens here, we'll see the traffic buildup in the
HPA outpace, the speed at which we were able to stand
up the new nodes.
The placeholder pod didn't actually get landed
on the node until sometime later.
As you can see, we're adding nodes, and immediately we're not even getting the
chance to schedule the over provisioning pod in them.
Though the over provisioning pod has always made sure that we've always had
a request in flight to get a new node,
but the result of this ultimately is, as you can see here, is that the
traffic steadily kind of paces up. There's some bumps,
but we can see that there's no kind of flattening
out. So there's no point where we actually are failing requests
at any point.
So this all comes at an inevitable cost. Your plan
is, in this case, to always have extra capacity ready and waiting
for your workload to require it. What might therefore
be some better answers, though? Well, you can tune your workload
in order to make sure that you're not leaving gaps.
Or better yet, remember that pod priority thing that we used?
Well, if you've got workload that suits it on your cluster, that could like
to run and would give you more return on investment than just a placeholder.
So stuff that can handle stopping and starting when
it's appropriate to. So perhaps some housekeeping,
some analytics and machine learning, or maybe just less important
services. So you might want to say proactive. The shopping cart
supporting applications and pods over, say, the customer service
desk. Ones that you can structure,
allowing you to structure your cluster workload to be more aligned to
your business benefits and goals.
I've been Chris Nesbittsmith thank you again for joining me today.
Like subscribe, whatever the kids do on LinkedIn, GitHub, whatever you can be.
Rest assured there'll be little to no spam since I'm much content at all since
I'm awful at self promotion, especially on social media.
CNS me just points at my LinkedIn talks. CNS me contains
this and some other talks, and they're all open, so that's
the end. Questions are very welcome on this or pretty much anything else.