Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, welcome to this session. The freedom of Kubernetes
requires chaos engineering to shine in production.
So obviously if you're connected to this session, there is a big chance that you
are interested in Kubernetes and also in chaos engineering. So take
a glass of water, a glass of coffee,
enjoy yourself, relax and let's
spend half an hour together. So before you start actually
to the main content, I would like to briefly introduce
myself. So, my name is Henrik rexed.
I'm a cloud native advocate at Dynatrace.
But prior to Dynatrace I've been involved in the
performance engineering market more than 15 years.
And as a result of that I have become
one of the producers of one of YouTube channel dedicated for performance
engineers called Perfytes. Check it out if you're looking for content for performance
engineers. On the other hand,
last year, in July 2021, I started a new
fresh YouTube channel called is it observable?
It's a dedicated YouTube channel related to observability
in general. So if you're looking for content tutorials,
content that will explain a given framework or technology, check it out.
It will really helpful. And I also looking for feedback. So please
connect and send me your feedback. So what are
we going to learn if you stay with me on the next 30 minutes?
So a couple of things. Because we're going to talk a lot about Kubernetes and
the challenges and the problems that we could probably face in production, it makes
sense that we do some couple of reminding related to Kubernetes.
We will of course present the challenges itself. And then to
validate that we will use chaos engineering. So we'll introduce what is chaos
engineering and then see what will be the experiments that
we will need to design to be able to validate our community settings.
Last, because we do some testing,
we also need to have observability in place. So we will see what type
of metrics and events that we will need
to collect to be able to validate our experiments.
And last, we will briefly explain how we could
automate that process. So the dark side
of kubernetes. So Kubernetes is
an orchestration framework. Everyone knows it, no surprises.
And in kubernetes there are two type of
nodes. In fact, Kubernetes relies on nodes and nodes at the end are
physical servers and virtual servers or virtual servers.
The master node is one of the node that is in the top.
In this slide, as you can see, there are various icons. You can see the
scheduler, etCD control manager and the API
server. So the master node, if you're using on any
managed Kubernetes environment provided by any of the iposcaler
AVs or Azure or GCP, then you
probably don't see that master node. If you do manage fully
the cluster, you will have to manage as well the master node.
On the bottom you have the workload nodes. The worker node are there
to basically host our workload. So when we deploy
any workload within our cluster, Kubernetes will basically
move our workload in different state.
And behind that there is a lot of events,
and I need to remind those projects because it's very important
to understand the various challenges that we're going to talk to in a few minutes.
So when we deploy using Kucatl
or anything else or maybe other systems,
the first step of deploying our workload will
be in pending state. So pending state means
kubernetes know it has to deploy a new workload. So he
will try to identify a node that is able to
host that new workload. So based on resources, based on
tension tolerance and various policies, then once he has
identified in the right node, then our workload moves to
the creating state. Creating state means I know which node going to host
my workload. And because our workload relies on
containers at that state, Kubernetes will basically
interact with our docker registry to pull out the images
and validate that there is all the requires that is needed.
If we have any volumes, if you have any config map, any secrets,
it will check that those exist to be able to deploy it.
Then once we have all the requirements out there,
then our workload is in a running state. So it doesn't mean
that it's officially running, it's just that Kubernetes has started
the pod with the various containers in it. And if you want to
check if the app is obviously running, you need to check all the
readiness probes or health probes that Kubernetes provides.
All right, so now you know all the various states of our workload.
Once upon a time, Kubernetes killed my workload. All right, so here,
let's say we have a cluster, I have two applications, so I've decided
to have two pods. Let's pretend that it's not two pods, but two namespaces
with a lot of various pods inside of them.
And I have more than one node. So here we see only one worker node.
And during that time frame one of the app was basically
consuming more resources and as a consequence there were
almost no resources left for the other workload. And as a consequence
to be able to avoid any node pressure events
or infrastructure issues, Kubernetes will try to resolve that
problem and for that it will start an eviction
process. Eviction process means it's going to select one of the
existing pod running in that node and it will evict it.
Invicted means first I kill the pod
from that node and then I reschedule it on an
available node that can take that new pod or that new workload.
So here in few minutes we were able to resolve
our pressure situation and our user
were almost not impacted.
But it could be worse, it could be very very worse.
Imagine that all your nodes are pretty much saturated,
or you have designed some tension tolerance policies
and there is no nodes for your workload anymore.
So you have killed the workload from these
previous node and that new workload
is not able to be scheduled elsewhere. So basically no app, nothing responding
to our user. So here it's pretty critical for us.
So we need to figure out how we can avoid that type of critical situation.
So how can we do that? Well, for that they are,
they say the recommended Kubernetes approach.
So first, eviction works
by quality of services.
So we need to define requires and limits in our pods.
So by doing this, Kubernetes will
basically determine a specific quality of services related to our
request and limits. So if our requests
equal our limits, then our workload will be considered
to be guaranteed. If the request is under limits then
we are burstable and asked. There's nothing defined. It's best effort
when the eviction happened, the eviction will happen in that order.
So first it will try to delete the best effort,
then burstable and ask guaranteed. The second recommendation
of course is to put resource quotas in our namespaces. Remember our
situation where the app we're eating the resources of
the second app. If I want to avoid that, because we usually silo
our apps based on namespaces. If I define resource quotas
on my namespace, then I'm pretty sure that I won't be able to
have that situation because my app will basically have dedicated resources
and it won't be able to eat more.
So requires and limits is a hot topic in
kubernetes. So let's have a look at the request and the limits.
What is the value of that and how you can express it. So first
request requires is
a bit like Tetris games. So kubernetes behaves like a Tetris game.
Remember you have shapes going down from the screen we move
them and we place them on available spots. And the idea we want
to make lines. It's the same thing with Kubernetes. Kubernetes we deploy, we have
a new workload. He sees the workload coming in and
then based on the size of the workload, so the request, he will
try to place it on the right node. So basically request
is there to tell to kubernetes. Okay, I need at least
100 megs to run my workload. So he
knows basically the shape of your pod.
So we will be able to play a tetris if you
don't precise any request at the end kubernetes doesn't know
the shape of your workload. So he sees a small square,
he place it somewhere and then suddenly that square becomes a huge
shape. So impossible to play properly the game.
We can also imagine that Kubernetes is a. But like a box of chocolate.
Remember when you're a kid you receive a box of chocolate.
Of course you don't read the manual, I mean who reads the manual today?
But we pick the chocolates and these we discover, we eat it and suddenly we
discover there's liquors. And remember we were making those faces.
I don't hate liquors. Whatever. Maybe today is different but
it's the same thing. If you dont precise the request and limits the
same thing, kubernetes will take the chocolate, it will imagine that it's a chocolate
without liquor. And then he starts to deployed it and he realized there's liquors inside
of that. So settings requests is very helpful
because it helps kubernetes to properly orchestrate your workload.
So how do you express it? Cpu of course in mini
cores and memory invites, nothing complicated. If you do the
right test we know what will be the minimal resources
that we need to run properly. Our applications, of course you can.
But very high values because if
you don't want to test and you want to basically guess,
yes you can do. But keep in mind that those resources will be allocated
by kubernetes and they will never be used. So at the end you're
not optimizing the usage of your node properly.
The second concept is limits. So now we
have defined that we need a three room bed apartment. And now
we have a contract with kubernetes saying okay so you have the three bedroom
apartment, it's fine, but you won't be able to
consume more power or more water during some period of the
day. So that's basically the limit. So we have a contract with the Kubernetes saying
how much resources can I officially utilize in that
cluster in maximum what will be tolerated by kubernetes?
So for that we can express it for
the memory in byte. So this is very useful, very easy,
I do a low test, I can see what's the maximum value that
I need for memory, so that I can basically guess
it. On the cpu side it's more difficult. And this
is due to the fact that we have an heritage from the Docker world.
So the way docker works to share resources within our host it users
CFS.
And for this the cpu basically is
split it in function timing. So let's say that
owned core has periods
of work of 100 milliseconds. And we're going to determine
the quota that we can use within that period. So we
have 100 milliseconds of cpu
periods. And if I defined, let's say 20,
I will be able to consume only 20 milliseconds. So basically I
do some work, I consume 20 milliseconds, then the
node or docker will basically pose my work and
wait until the next cpu cycle. And then I
will be able to resume my work at least for 20 milliseconds
and so on and so forth. And that mechanism to be posed during our work
means throttle. So cpu throttled.
So if I define my value too low, as a consequence
I will get a lot of cpu throttling. And in these memory,
if my value too low, then kubernetes will kill
my workload and send an event called om kill.
And if we don't input the value too high at the end, we are
not optimizing properly our resource in our cluster.
So that's why it's really important to define it properly
those limits at least because in the memory we can see that kubernetes
can kill our workload. And if these in the cpu
side, basically we are throttled. And as a consequence our workload will be very
slow to be able to work properly. In fact
it's funny because when I started to work with those
requested limits, I said let's do a test. So I said let's
remove all the limits from our workload, run it and
measure on the response times in one end and on
the resources. So you can see on the graph on the top, it's a cpu
usage. You can see that without I was able to consume almost
650 millicores. And when
applied limits, you can see that my resources, I'm consuming way,
way less resources. It's normal because I have a limit defined.
But now let's have a look at the actual experience
that we're giving to our users. In the bottom you can see response
times. As you can see in the bottom we have these without
we have a response time that is about 100 milliseconds or
less. Pretty good. But then when I apply the limits,
boom. You can see that we almost have 3.6 seconds of
response times, which is very very high. And this is
only due to CP throttling. So what
do I need to put resource cpu limits in my cluster?
Because at the end it seems that it works better without it.
Well, because it's a best practice for the industry, at least
for the memory side you need to define requires and limits,
otherwise you dont utilize properly your resources and your nodes.
And for this topic request and limits there is
tons of horror stories available over
various presentations done in Kubecon.
So here you can scan the QR code. There comes to a website
listing all those horror stories. So I will definitely recommend to
watch it. There is plenty of interesting stories performance issues related
to cpu throttling, RBnB Zorando that talks about
it, stability issues due to OMKL. Same thing,
RBnB Zorando. So check it out. You see that you can learn a lot of
things and this topic is very important because at the end we know
that we can have a major impact on the stability
and the behavior of users. So how do
we validate and avoid
that type of situation? Well, obviously chaos engineering.
So what is chaos engineering? Chaos engineering. If I took the definition for in Wikipedia
you say chaos engineering is a process to discover vulnerability
by injecting failures and errors. And it even says in production.
So first of all, don't do it directly in production. You don't improvise
in production. So first you do it on a non production environment.
And then once you're mature enough, you move to closer to a production environment.
So let's have a look at the various workflow. How do you define
those errors and vulnerabilities in our environments? So first,
the process is in our three steps. The first step is we need to
define hypotheses so we can say, okay, I've designed my
app, I know the architecture, I know the system,
what could go wrong? What could fail?
So basically, for example, I have a connection to database, I may assume
that I can basically have network connectivity issues between my system and
database. That could be a problem. Then I need to predict how my system
react. Either it will basically handle
properly because I've designed an awesome architecture, an awesome design
of a software. Or I can also predict that, okay,
my system is going to fail or have some problems to write to database.
So we need to list basically what we expect from that situation.
Then we need to define what are the metrics and events,
whatever that we need to collect to be able to validate our experiments.
Then we define our experiments. Usually an experiment
is a workflow of tasks. So first I want
to inject latency between my app
and my database. Or I want to inject,
let's say packet loss or whatever. Basically to simulate
network problems I can restart my app. Basically you define
the workflow of this given specific situation that you want to
test. Then we need to define how to rollback.
Why? Because keep in mind that we're going to run that probably in production.
So if in case there is a problem, we need to have the process described
and automated to be able to come back to a normal state,
then we also need to figure out how we're going to collect the various KPI
that is required to validate our experiment. And last
is basically we can our test. All right, so what are the hypotheses
related to Kubernetes? Remember we talked about it, request and limits. So what
are the various hypotheses?
Well, I got some ideas. So first we have Kubernetes settings.
We know that if I change requires and limits, I have an impact probably on
my users. So I want to basically validate
that those settings are working fine. So my expectation is that kubernetes
is good. I have already defined the right request and limits.
So I expect that my app is stable performance, no impact
and no error rates. Okay, fine, fair enough.
Then you have the maintenance scenario. What are we referring to
maintenance by the way? Yeah. Remember when
you have to upgrade your nodes because nodes at the end are a
physical or virtual service like I mentioned. So if you want to upgrade the version
of kubernetes because you're upgrading your master nodes, then to do that
you will have to node drain. So you remove that node from
the actual work, you do your maintenance task and then you
reattach it to the cluster back. So this is a maintenance task.
These will clearly happen during your production hours or
night hours. But as an expectation, I say if I
have designed well my cluster, I should have no impact,
stable, no user impacted, everything works smoothly.
And last, it's a store that we had before eviction.
So we're turning into nodes pressure or a situation where we come
into these situation where there is an eviction. So my expectation is
that because I have defined the right priority on my pods, no downtime,
stable and performance, no impact on my users.
What are the observerity pillars that we need to collect? Of course in
observability there is plenty of pillars. You have logs, you have traces,
you have metrics, you have Kubernetes events is
also one of the pillar. And then you can add profiling and others.
But here in our particular case, because it's pretty much related to infrastructure
topic, we will focus on metrics and Kubernetes events.
So let's have a look at those metrics and events that make sense for us.
So first let's have a look at the metrics. Keep in mind that
Kubernetes is like applications are a bit like an
onion. There's different layers. So first you have the layer outside
which is the user. So the user is interacting
with our app. So I will probably collect some metrics about
these user. So response times, failure rate,
basically user experience. Then I will go
to the next layer. Because Kubernetes relies
on nodes I need to figure out how my nodes are behaving.
So in terms of resources, cpu memory, number of
pods running on that specific node, maybe also
number of ports, ports available,
ip address and so on a pod.
So within that node I have pods running in that.
So here I will keep track on what I have defined in teams
of requires and limits for my cpu and memory, what is the
limits and what is the actual usage. So then I can figure but if
I'm far from the limits or not so I can optimize that settings.
And last in the pod I have some containers. And remember cpu
throttling is a docker concept so we need to
measure that from the container perspective. So we'll look at the cpu throttling,
the memory usage and cpu usage of course on
the events side, as we saw at the beginning,
there are various states in Kubernetes and in those states
there are different types of events that is sent by Kubernetes.
So of course the user won't send any major events except maybe tweets or
support case. But we are focused mainly on the events
coming from the cluster. So the node nodes pressure, that is
a sign of a problem on the pods, I could say fade scheduling.
That seems that we are not able to place
that workload anymore on any of the nodes. That could be a really important
sign. Eviction of course om kill unhealthy
that type of events. So what are these
kind of experiments that we need for Kubernetes so first we're going to test
the evictions and the maintenance scenario in the same one. Why?
Because I probably have a lot of nodes. So to be able to come
to these situation where I'm putting high pressure on my node,
I want to remove some nodes. So I'm going to have less nodes than expected
because I want to come to the situation where I'm having a node pressure.
So our first nodes drain and then I will then simulate
cpu stress on these node memory, stress on
the nodes and I will generate some
load test because I'm not going to run it in production for
this type of experiments. I want to run in a non production environment and I
need to measure the user, the impact on the user. So I will run a
constant cloud, so no spike test, nothing fancy,
just the load test will only be there to report
how is the actual response time
from the user perspective and what is the error rates, what's the stability
of the applications on the community
settings. And I may not need necessarily
some chaos experiments. If I want, I can do it to be able
to come to the right situation. But usually just a standard load test
is fair enough. You run the test stable load again
and you measure the response time, the failure rates, and you compare it between
with requires and limits without you measure the
cpu throttling and you tweak those settings to
at least get the right settings that will provide the highest response
times and with the right stability
of our apps. So what do we need
for this? So first I
will need a Kubernetes cluster and there is various tools
that going are to be deployed here. You can see that I have two
colors here to the nodes I've separated. I want
to make sure if I run the experiments on the same cluster as
my app, I want to make sure that the experiments are not impacting
my tooling. So I need an observability back end solution,
prometheus or data trace. I need a chaos engineering
tool. So in my case I use litmus and a load testing products. I use
k six in my case. So for these I have labeled
the nodes to make sure that all the toolings that I need for my
experiments are placed on dedicated nodes.
And then I have other nodes that are dedicated for my app. And I know
that my experience will only impact the nodes for my app
and not my testing tools.
Then like I said, I need some tools. So I need a chaos engineering
product. I use litmus chaos, which is part of CNCF.
Really good product by the way. There's a web UI called the Chaos center.
So litmus chaos can be installed either
in the same cluster as your app or it could be installed on dedicated cluster
and it will interact with your cluster where your app is running. For this.
In any case, you will need to deploy an agent either on these same cluster
as litmus chaos, or if you have a remote cluster,
you will have to install the agent that comes with the Chaos exporter.
The chaos exporter exposed couple of metrics to our experiments in a Prometheus
format and then there is a lot of chaos workflow and so on. I'm not
going to go in details on the architecture, but at least keep in mind that
those components, the one I'm showing here on the screen, are the main ones.
What is great with chaos litmus chaos is that there
is a chaos hub. That chaos hub provided 50
plus experiments in kubernetes and all the right
experiments for our case, which is node drain,
I need node cpu hog, node memory
hog, all are there. They are already there in the chaos
hub. So it's perfect. So I don't need to reinvent the wheel, I can simply
use it. The other advantage of litmus chaos is that it's rely on
argo workflow. So I can define a really specific workflow
combining pure chaos experiments and lotus with
k six in parallel. I'm picking
k six. Why? Because k six has an output extensions
that I personally like. It's their Prometheus integration.
So k six provides results that
is sent in the command line, in the std out or in JSoN
or other format. But in my case, I want k six to
write into Prometheus the statistics. So these I will have all
the observatory in Prometheus or dynatrace because nitroce will be able to scrap
the data from Prometheus. But at least I need the response times, the requires,
the failure rates. And I also want to have also all these
data related to the health of the cluster itself.
For this I will also have a Prometheus in my cluster.
If you install the Prometheus operator, it comes with
several components. So of course the Prometheus stack, it's the stack, but also a couple
of exporters. These exporters is the component producing
metrics. So we'll have the Kubernetes metrics to see how the
various objects of kubernetes, the node exporter, anything related to how
healthy are my nodes CA advisor to collect
cpu throttling metrics on the container level. And then I
have an exporter for litmus to be able to collect metrics from these nitrous
perspective. And I have an exporter for K six to be able to collect the
metrics from k six perspective and those with data trace
and our components, we will be able to collect the right metrics and push
it back to data trace. So now you know all
the toolings and let's have a look at how we could automate this
process. To automate this, we can define obviously a
pipeline in Jenkins or any CI CD system that we have.
So we're going to build, we're going to deploy, we're going to deploy our
exporters, we're going to configure data trace or
whatever is we can do it because all those tools could be configured with the
API or with command lines. And then I'm going to run my test.
Fair enough. But then after these test,
someone needs to approve, someone needs
to look at the results. So it's not automation anymore because
actually we have a pose here in this automation. So how can I remove that
pose? Well, for this I'm going to use another CNCF project
called captain. It's an open source solution provided by
Dynatrace. And captain provides several use
cases. So first you have the progress delivery. So you can
basically give the power to captain to
deploy manage test
similar to a CI CD process, and also
manage production use cases like automotions and so
on. Or I don't want to use everything. I just want to rely
on my traditional CI CD system. I just want to use captain
for coitigate and this is the use case I'm going to use.
Or last, I can use pure SRE and production use
cases, which is autoremediation.
Captain is very easy to configure. It's based on files,
on Yaml files. So you have a ship, your file SLo. So we need to
define the SLI. Define slos in
captain and those will be used, you see, for these Kuwait gate perspective.
And then we can connect our tools.
Captain is in a framework case on cloud events.
So basically it's an event driven framework.
So it's very easy. So I can easily connect, disconnect tools.
And the beauty is that all the tools I'm going to use is part of
captain. So if we look at the pipeline
that we had a few minutes ago, so I'm deploy,
I run my test, and then just
when I run the test, after the test itself,
I will basically send to captain, there's an API
to say, an event to say, hey captain,
I just finished the test. Could you evaluate the
environment during that time frame?
So I've already expressed a couple of SLI and SLO.
So captain will reach out to the SLI and SLO that we've defined for that
particular services. And so you say, okay, so pod failure needs
to be owned, hundred percent node pressure under 1% and so on. So you define
basically what you expect. Once he looks at all
the SLI, he will reach out to the observed backend
that has the value. So either Prometheus standard race,
and then he will look at the values and match it with the objective that
we have defined. And then he will basically present the results of each individual
SLO that we've defined into a heat map like you can
see here. And the great thing is, at the end it
provides a score, and at the end I will have one SLA
based on the score. So you say my experiments were fine. If we have a
score of 90%, for example, and that will be
basically in 1 minute, unless my test
ends, Captain trigger all that workflow I just described,
gets back the scoring, and from the scoring would say, okay, everything is green,
or in the other way, everything is red.
So let's have a few quick takeaway here. So, a couple
of things. So, first, in Kubernetes, define quotas,
resource quotas to separate non namespaces. That is a
really high recommendation from the community's world.
Define qos. So request and limits we talked about during
a long time, observability. Of course, we need to collect metrics,
logs, traces. We need to understand what's going on our
environment. So make sure to have everything sres.
So of course, define SLI and SLO. I mean,
we want to be smart, we want to automate. So without SLI SLO,
it will be very difficult to be efficient. And last, we know
that we have problem related to Kubernetes. We could face problems
related to Kubernetes. So let's test them utilizing low test and
chaos engineering to validate that our cluster is stable,
our user are happy, and there is no surprise in production.
So before we actually finish this presentation, I will do another
quick promo to my YouTube channel. Is it observable?
So check out the various episodes there. There is one dedicated
to Litmus chaos, chaos engineering, performance testing and so on.
So check it out. And yeah, I'm trying to improve the content, so if you
can send some feedback, that will be great. So then I can continue production content
and help you guys in your project. All right, so thank
you for your time and enjoy the conference.