Kubernetes monitoring - why it is difficult and how to improve it
Video size:
Abstract
The popularity of Kubernetes changed the way how people deploy and run the software. It also brought additional complexity of Kubernetes itself, microservice architecture, short release cycles - all these became a challenge for monitoring systems. The truth is, adoption and popularity of Kubernetes had severe impact on monitoring ecosystem, on its design and tradeoffs.
The talk will cover what are monitoring challenges when operating Kubernetes, such as increased metrics volume, services ephemerality, pods churn, distributed tracing, etc. And
how modern monitoring solutions are designed specifically to address these challenges and at what cost.
Summary
-
Aliaksandr Valialkin: Kubernetes exposes huge amounts of metrics on itself. Many users of kubernetes struggle with the complexity and monitoring issues. He explains why it is difficult and how to improve it.
-
There is no established standards for metrics at the moment. Community and different companies try to invent own standard and promote them. This leads to big amounts of metrics in every application. This also leads to outdated dashboards for kubernetes Grafana. New entities like distributed traces needs to be invented.
-
Kubernetes increases complexity and metrics footprint of current monitoring solutions. Most complexities are active time series churn rate and huge volumes of metrics for each layer. Victoria Matrix believes there must be a standard for kubernete monitoring.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, today I will talk
you about Kubernetes monitoring, why it is difficult and
how to improve it. Let's meet I'm
Aliaksandr Valialkin and our telemetrics founder and
code developer. I also known as
Go contributor and our author of popular libraries
for Go such as Fast RTP, fast cache
and Fixing plate. As you can see that
libraries start from fast and quick prefixes.
This means that these libraries are quite fast.
So I'm fond of performance optimizations.
What is Victoria metrics this
time? Services, database and monitoring solution.
It is open source, it is simple to set up and operate,
it is cost efficient, highly scalable and
it is cloud ready. We provide helm charts and
parameters for victory metrics
for kubernetes according
to recent surveys. From this survey,
for instance, you can see that the
amounts of monitoring data increases two
three times faster than the amounts of actual application data
and this not so good.
For instance, some people
twitters also monetized it and
say that these not so good because
these costs for storing monitoring data increases
much faster comparing to cost for storing application
data.
According to the recent CNCF survey,
many users of kubernetes struggle
with the complexity and monitoring
issues. As you can see, 27% of
these users don't like the state of
monitoring in kubernetes.
So why kubernetes monitoring is so challenging?
The first thing is that kubernetes exposes big
amounts of metrics on itself.
You can look at this link and see how
many kubernetes companions expose huge
amounts of metrics and
the number of this exposed metrics grows over
time. Let's look at these graph this graph
shows that the number of unique metric
names which are exposed by Kubernetes companions has
been grown from 150 in
2018 in Kubernetes version one point
ten to more than 500 in Kubernetes
1.24 which has been released
recently.
The number of metrics unique metrics which
are exposed by applications grow not
only in Kubernetes services, by in
any application, for instance, not the export services component which
is usually used in kubernetes and for
monitoring hardware also increases
the number of unique metrics. For instance, the number of metrics in
node exporter increased from 100 to more than 400
in the last five years.
Every kubernetes node exports at least
2500 series
and this doesn't count the
number of application metrics. This metrics
includes nodexperter's metrics, kubernetes and these advisor
metrics. And according to our
study, we see that these average
number of such metrics per each node Kubernetes
node is around 4000.
So if you have 1000 pods then
your kubernetes cluster will expose 4
million metrics which should be collected.
What is these source of such big amounts of metrics?
This is because of multilayer
architecture of modern systems.
Let's look at this picture. You see, you can see that hardware
server contains virtual machine
and each virtual machine contains pods and each pods
contains applications and each application contains
container. And all these levels must
have some observability. This means that
they need to expose some metrics.
And if you have multiple containers in kubernetes,
multiple pods in kubernetes, then the number of
exposed metrics increases with
these numbers of pods and containers.
Let's look at the simple example. When you
deploy Nginx these leprechauns
Genix in kubernetes, they already
generate more than 600 new time series
according to advisor. And these metrics
don't count application metrics. These means that metrics
which are exposed by Nginx itself.
Another issue with kubernetes monitoring is time
series charm when old series
are substituted by new ones. And monitoring solutions
don't like high charm rate because it leads to
memory issues, memory usage issues and cpu
time issues.
Kubernetes tends to generate high churn rate
for active time series became of two things. The first
thing is frequent deployments. When you deploy
new deployment then
a new set of metrics for these deployments are generated
because every such metric usually
contain pods metric and this pod metric is
usually generated automatically by kubernetes.
And another source of high churn rate is
pods autoscale events. When pods
scale then new pod names appear
and metrics for these new pods
should be registered in monitoring system and this generates high
churn rate and
the number of new metrics which are generated with
each deployment or portal scale event can
be estimated as the number of container
start metrics for this application. For each instance
of this application plus the number of
application metrics and this number
should be multiplied by the number of replicas for this
deployment and these number of deployments.
And as you can see if these
churn rate increases with the number of deployment and the number
of replicas, do we need all these metrics?
The answer is not so easy. Like you see some
people say that no we don't need all these metrics because
our monitoring systems uses only a small
fraction of the collected metrics.
But others say yes we need this
collecting all these metrics because these metrics can be
used in the future.
How to determine the exact set of needed metrics there
is a mimir tool from Grafana which scans
your recording and alerts and rules, and also scans
your dashboard queries and decides
which metrics are used and which metrics aren't
used and then it can generate
hollow list for user metrics.
And for instance, Grafana says
that if you have kubernetes cluster
with three pods, this cluster exposes
40,000 active time series by default.
And if you run memir tool and apply all
lists to labeling rules, this always
reduces the number of active time series from 40,000 to 8000.
This means more than five times less.
So what does it mean? It means that existing solution
slots like kubernetes prometheus stack collect too many
metrics and most of these are unused.
This chart shows that only
24% of collected metrics
from Kube Prometheus tech are actually used by alerts
and recording rules and these boards and 75%
of metrics never used by current
monitoring solutions.
This means that you can reduce your spend
expenses on monitoring solutions by 76%.
That's more than five four times.
Let's talk about monitoring standards.
Unfortunately, there is no established standards for
metrics at the moment.
Community and different companies try to invent
own standard and promote them. For instance,
Google promotes four golden
signal standard, Brendan Greg
promotes used standard for monitoring and weave
works promotes red standard but
so many different standards fits to the following
situation that nobody follows
a single standard and everybody follows different standards
or doesn't follow any standard. This leads to
big amounts of metrics in every application and
these metrics changed over time.
And you can read many articles and opinions about
most essential metrics and there is no
single source of truth for monitoring.
This also leads to outdated dashboards
for kubernetes Grafana for instance,
the most popular dashboards in Grafana are now outdated.
Grafana and
Kubernetes provokes
you to generate to use
microservices architecture and microservices
architecture has some challenges.
Every microservice instance needs to own
metrics. The users need to track and correlate events
across multiple services.
FML services also makes
situation worse. FMLC means that
every mega service can be started,
redeployed, stopped at any time and
because of this situation, new entities like distributed traces
needs to be invented and used in founder CTO improve
the observability station for microservice microservice
talk to each other via network
so you need to improve networking to
monitor networking service
allocation on one, not create a nosy neighbor
problem. And this problem also needs to be resolved
and service mesh introduces yet another layer
of complexity which needs to be monitored.
How kubernetes affects the monitoring as
you can see from previous slides, kubernetes increases complexity
and metrics footprint current
monitoring solutions such as parameters, victory metrics,
tunnels, cortex are busy with overcoming
complexities introduced by kubernetes.
These most complexities are active time
series churn rate which are generated
from the
service and huge volumes of metrics for each layer. And service
developers of current monitoring solutions
spent big amounts of efforts
for adopting these monitoring tools
for kubernetes. Because of this,
maybe if there was no kubernetes we won't
need distributed traces and examplers because distributed
traces and examplers are used only solely for microservices
and kubernetes. And maybe if there was no kubernetes
all this time on overcoming difficulties
in current monitoring solution could be invested into more
useful observability tools such as automated protocols
analysis or metrics correlation who
knows how
Kubernetes deals with millions of metrics?
These answer is that kubernetes doesn't deal, doesn't provide
good solutions. It provides only two flux which
can be used for blacklisting,
dissolving some metrics and other
label values. That's not so good
solution. How does Prometheus
deals with Kubernetes challenges?
Actually Prometheus version two
has been created because of
kubernetes because it needs to solve
kubernetes challenges with high number of
time series and high churn rate. You can read the
announcement of Prometheus version two
in order to understand how they redesigned
internal architecture of Prometheus solely for solving
Kubernetes issues.
But still Kubernetes issues such as high churn
rate and cardinality aren't solved in Prometheus and other
metronic solution. Actually Victoria
metrics also deals kubernetes changed. Actually Victoria
metrics has been appeared as
a system which solves cardinality issues in Prometheus version one.
It is optimized for using lower amounts of memory in this space
when working with high card analysis series.
It also provides optimizations
to overcome time series charm which is
common in Kubernetes and
Victoria metrics. Also,
we at Victoria Matrix also don't know how to reduce
the number of time series and Victoria. New versions
of Victoriametrics increased the number of exported time series over
time. You can see that the number of new metric
names which are exposed by Victoria metrics growth around
three times during the last four years
and only 30% of these metrics
are actually used by Victoria metrics,
dashboards and alerts and records and rules.
How can we improve the situation? We believe
that kubernetes monitoring complexity must be reduced
and the number of exposed metrics must be
reduced. The number of histograms must be
reduced because the histogram is the
biggest offender of cardinal which
generates many new time series,
the number of parametric labels must be reduced.
For instance, in kubernetes it is common practice to put
all the labels which are defined at pod level to all
the metrics exposed by this pod and
probably this not correct. We should change these
situation. Time series
churn rate must be reduced the most common time
series churn rate source in kubernetes
is horizontal port auto scaling and deployments.
And we should think hard how to reduce
churn rate for these sources.
And we believe that community will come up with a standard for kubernetes
monitoring which will be much
lightweighter and will need to collect a much lower
number of metrics compared to the current state of
monitoring of kubernetes. So let's do it together.
Now you can ask questions.