Kubernetes monitoring - why it is difficult and how to improve it
            
            
            
              Video size:
              
              
            
            
           
          
            
              Abstract
            
The popularity of Kubernetes changed the way how people deploy and run the software. It also brought additional complexity of Kubernetes itself, microservice architecture, short release cycles - all these became a challenge for monitoring systems. The truth is, adoption and popularity of Kubernetes had severe impact on monitoring ecosystem, on its design and tradeoffs.
The talk will cover what are monitoring challenges when operating Kubernetes, such as increased metrics volume, services ephemerality, pods churn, distributed tracing, etc. And
how modern monitoring solutions are designed specifically to address these challenges and at what cost.
           
          
          
          
            
              Summary
            
            
              
              - 
                Aliaksandr Valialkin: Kubernetes exposes huge amounts of metrics on itself. Many users of kubernetes struggle with the complexity and monitoring issues. He explains why it is difficult and how to improve it.
              
 
              
              - 
                There is no established standards for metrics at the moment. Community and different companies try to invent own standard and promote them. This leads to big amounts of metrics in every application. This also leads to outdated dashboards for kubernetes Grafana. New entities like distributed traces needs to be invented.
              
 
              
              - 
                Kubernetes increases complexity and metrics footprint of current monitoring solutions. Most complexities are active time series churn rate and huge volumes of metrics for each layer. Victoria Matrix believes there must be a standard for kubernete monitoring.
              
 
              
            
           
          
            
              Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hello everybody, today I will talk
            
            
            
              you about Kubernetes monitoring, why it is difficult and
            
            
            
              how to improve it. Let's meet I'm
            
            
            
              Aliaksandr Valialkin and our telemetrics founder and
            
            
            
              code developer. I also known as
            
            
            
              Go contributor and our author of popular libraries
            
            
            
              for Go such as Fast RTP, fast cache
            
            
            
              and Fixing plate. As you can see that
            
            
            
              libraries start from fast and quick prefixes.
            
            
            
              This means that these libraries are quite fast.
            
            
            
              So I'm fond of performance optimizations.
            
            
            
              What is Victoria metrics this
            
            
            
              time? Services, database and monitoring solution.
            
            
            
              It is open source, it is simple to set up and operate,
            
            
            
              it is cost efficient, highly scalable and
            
            
            
              it is cloud ready. We provide helm charts and
            
            
            
              parameters for victory metrics
            
            
            
              for kubernetes according
            
            
            
              to recent surveys. From this survey,
            
            
            
              for instance, you can see that the
            
            
            
              amounts of monitoring data increases two
            
            
            
              three times faster than the amounts of actual application data
            
            
            
              and this not so good.
            
            
            
              For instance, some people
            
            
            
              twitters also monetized it and
            
            
            
              say that these not so good because
            
            
            
              these costs for storing monitoring data increases
            
            
            
              much faster comparing to cost for storing application
            
            
            
              data.
            
            
            
              According to the recent CNCF survey,
            
            
            
              many users of kubernetes struggle
            
            
            
              with the complexity and monitoring
            
            
            
              issues. As you can see, 27% of
            
            
            
              these users don't like the state of
            
            
            
              monitoring in kubernetes.
            
            
            
              So why kubernetes monitoring is so challenging?
            
            
            
              The first thing is that kubernetes exposes big
            
            
            
              amounts of metrics on itself.
            
            
            
              You can look at this link and see how
            
            
            
              many kubernetes companions expose huge
            
            
            
              amounts of metrics and
            
            
            
              the number of this exposed metrics grows over
            
            
            
              time. Let's look at these graph this graph
            
            
            
              shows that the number of unique metric
            
            
            
              names which are exposed by Kubernetes companions has
            
            
            
              been grown from 150 in
            
            
            
              2018 in Kubernetes version one point
            
            
            
              ten to more than 500 in Kubernetes
            
            
            
              1.24 which has been released
            
            
            
              recently.
            
            
            
              The number of metrics unique metrics which
            
            
            
              are exposed by applications grow not
            
            
            
              only in Kubernetes services, by in
            
            
            
              any application, for instance, not the export services component which
            
            
            
              is usually used in kubernetes and for
            
            
            
              monitoring hardware also increases
            
            
            
              the number of unique metrics. For instance, the number of metrics in
            
            
            
              node exporter increased from 100 to more than 400
            
            
            
              in the last five years.
            
            
            
              Every kubernetes node exports at least
            
            
            
              2500 series
            
            
            
              and this doesn't count the
            
            
            
              number of application metrics. This metrics
            
            
            
              includes nodexperter's metrics, kubernetes and these advisor
            
            
            
              metrics. And according to our
            
            
            
              study, we see that these average
            
            
            
              number of such metrics per each node Kubernetes
            
            
            
              node is around 4000.
            
            
            
              So if you have 1000 pods then
            
            
            
              your kubernetes cluster will expose 4
            
            
            
              million metrics which should be collected.
            
            
            
              What is these source of such big amounts of metrics?
            
            
            
              This is because of multilayer
            
            
            
              architecture of modern systems.
            
            
            
              Let's look at this picture. You see, you can see that hardware
            
            
            
              server contains virtual machine
            
            
            
              and each virtual machine contains pods and each pods
            
            
            
              contains applications and each application contains
            
            
            
              container. And all these levels must
            
            
            
              have some observability. This means that
            
            
            
              they need to expose some metrics.
            
            
            
              And if you have multiple containers in kubernetes,
            
            
            
              multiple pods in kubernetes, then the number of
            
            
            
              exposed metrics increases with
            
            
            
              these numbers of pods and containers.
            
            
            
              Let's look at the simple example. When you
            
            
            
              deploy Nginx these leprechauns
            
            
            
              Genix in kubernetes, they already
            
            
            
              generate more than 600 new time series
            
            
            
              according to advisor. And these metrics
            
            
            
              don't count application metrics. These means that metrics
            
            
            
              which are exposed by Nginx itself.
            
            
            
              Another issue with kubernetes monitoring is time
            
            
            
              series charm when old series
            
            
            
              are substituted by new ones. And monitoring solutions
            
            
            
              don't like high charm rate because it leads to
            
            
            
              memory issues, memory usage issues and cpu
            
            
            
              time issues.
            
            
            
              Kubernetes tends to generate high churn rate
            
            
            
              for active time series became of two things. The first
            
            
            
              thing is frequent deployments. When you deploy
            
            
            
              new deployment then
            
            
            
              a new set of metrics for these deployments are generated
            
            
            
              because every such metric usually
            
            
            
              contain pods metric and this pod metric is
            
            
            
              usually generated automatically by kubernetes.
            
            
            
              And another source of high churn rate is
            
            
            
              pods autoscale events. When pods
            
            
            
              scale then new pod names appear
            
            
            
              and metrics for these new pods
            
            
            
              should be registered in monitoring system and this generates high
            
            
            
              churn rate and
            
            
            
              the number of new metrics which are generated with
            
            
            
              each deployment or portal scale event can
            
            
            
              be estimated as the number of container
            
            
            
              start metrics for this application. For each instance
            
            
            
              of this application plus the number of
            
            
            
              application metrics and this number
            
            
            
              should be multiplied by the number of replicas for this
            
            
            
              deployment and these number of deployments.
            
            
            
              And as you can see if these
            
            
            
              churn rate increases with the number of deployment and the number
            
            
            
              of replicas, do we need all these metrics?
            
            
            
              The answer is not so easy. Like you see some
            
            
            
              people say that no we don't need all these metrics because
            
            
            
              our monitoring systems uses only a small
            
            
            
              fraction of the collected metrics.
            
            
            
              But others say yes we need this
            
            
            
              collecting all these metrics because these metrics can be
            
            
            
              used in the future.
            
            
            
              How to determine the exact set of needed metrics there
            
            
            
              is a mimir tool from Grafana which scans
            
            
            
              your recording and alerts and rules, and also scans
            
            
            
              your dashboard queries and decides
            
            
            
              which metrics are used and which metrics aren't
            
            
            
              used and then it can generate
            
            
            
              hollow list for user metrics.
            
            
            
              And for instance, Grafana says
            
            
            
              that if you have kubernetes cluster
            
            
            
              with three pods, this cluster exposes
            
            
            
              40,000 active time series by default.
            
            
            
              And if you run memir tool and apply all
            
            
            
              lists to labeling rules, this always
            
            
            
              reduces the number of active time series from 40,000 to 8000.
            
            
            
              This means more than five times less.
            
            
            
              So what does it mean? It means that existing solution
            
            
            
              slots like kubernetes prometheus stack collect too many
            
            
            
              metrics and most of these are unused.
            
            
            
              This chart shows that only
            
            
            
              24% of collected metrics
            
            
            
              from Kube Prometheus tech are actually used by alerts
            
            
            
              and recording rules and these boards and 75%
            
            
            
              of metrics never used by current
            
            
            
              monitoring solutions.
            
            
            
              This means that you can reduce your spend
            
            
            
              expenses on monitoring solutions by 76%.
            
            
            
              That's more than five four times.
            
            
            
              Let's talk about monitoring standards.
            
            
            
              Unfortunately, there is no established standards for
            
            
            
              metrics at the moment.
            
            
            
              Community and different companies try to invent
            
            
            
              own standard and promote them. For instance,
            
            
            
              Google promotes four golden
            
            
            
              signal standard, Brendan Greg
            
            
            
              promotes used standard for monitoring and weave
            
            
            
              works promotes red standard but
            
            
            
              so many different standards fits to the following
            
            
            
              situation that nobody follows
            
            
            
              a single standard and everybody follows different standards
            
            
            
              or doesn't follow any standard. This leads to
            
            
            
              big amounts of metrics in every application and
            
            
            
              these metrics changed over time.
            
            
            
              And you can read many articles and opinions about
            
            
            
              most essential metrics and there is no
            
            
            
              single source of truth for monitoring.
            
            
            
              This also leads to outdated dashboards
            
            
            
              for kubernetes Grafana for instance,
            
            
            
              the most popular dashboards in Grafana are now outdated.
            
            
            
              Grafana and
            
            
            
              Kubernetes provokes
            
            
            
              you to generate to use
            
            
            
              microservices architecture and microservices
            
            
            
              architecture has some challenges.
            
            
            
              Every microservice instance needs to own
            
            
            
              metrics. The users need to track and correlate events
            
            
            
              across multiple services.
            
            
            
              FML services also makes
            
            
            
              situation worse. FMLC means that
            
            
            
              every mega service can be started,
            
            
            
              redeployed, stopped at any time and
            
            
            
              because of this situation, new entities like distributed traces
            
            
            
              needs to be invented and used in founder CTO improve
            
            
            
              the observability station for microservice microservice
            
            
            
              talk to each other via network
            
            
            
              so you need to improve networking to
            
            
            
              monitor networking service
            
            
            
              allocation on one, not create a nosy neighbor
            
            
            
              problem. And this problem also needs to be resolved
            
            
            
              and service mesh introduces yet another layer
            
            
            
              of complexity which needs to be monitored.
            
            
            
              How kubernetes affects the monitoring as
            
            
            
              you can see from previous slides, kubernetes increases complexity
            
            
            
              and metrics footprint current
            
            
            
              monitoring solutions such as parameters, victory metrics,
            
            
            
              tunnels, cortex are busy with overcoming
            
            
            
              complexities introduced by kubernetes.
            
            
            
              These most complexities are active time
            
            
            
              series churn rate which are generated
            
            
            
              from the
            
            
            
              service and huge volumes of metrics for each layer. And service
            
            
            
              developers of current monitoring solutions
            
            
            
              spent big amounts of efforts
            
            
            
              for adopting these monitoring tools
            
            
            
              for kubernetes. Because of this,
            
            
            
              maybe if there was no kubernetes we won't
            
            
            
              need distributed traces and examplers because distributed
            
            
            
              traces and examplers are used only solely for microservices
            
            
            
              and kubernetes. And maybe if there was no kubernetes
            
            
            
              all this time on overcoming difficulties
            
            
            
              in current monitoring solution could be invested into more
            
            
            
              useful observability tools such as automated protocols
            
            
            
              analysis or metrics correlation who
            
            
            
              knows how
            
            
            
              Kubernetes deals with millions of metrics?
            
            
            
              These answer is that kubernetes doesn't deal, doesn't provide
            
            
            
              good solutions. It provides only two flux which
            
            
            
              can be used for blacklisting,
            
            
            
              dissolving some metrics and other
            
            
            
              label values. That's not so good
            
            
            
              solution. How does Prometheus
            
            
            
              deals with Kubernetes challenges?
            
            
            
              Actually Prometheus version two
            
            
            
              has been created because of
            
            
            
              kubernetes because it needs to solve
            
            
            
              kubernetes challenges with high number of
            
            
            
              time series and high churn rate. You can read the
            
            
            
              announcement of Prometheus version two
            
            
            
              in order to understand how they redesigned
            
            
            
              internal architecture of Prometheus solely for solving
            
            
            
              Kubernetes issues.
            
            
            
              But still Kubernetes issues such as high churn
            
            
            
              rate and cardinality aren't solved in Prometheus and other
            
            
            
              metronic solution. Actually Victoria
            
            
            
              metrics also deals kubernetes changed. Actually Victoria
            
            
            
              metrics has been appeared as
            
            
            
              a system which solves cardinality issues in Prometheus version one.
            
            
            
              It is optimized for using lower amounts of memory in this space
            
            
            
              when working with high card analysis series.
            
            
            
              It also provides optimizations
            
            
            
              to overcome time series charm which is
            
            
            
              common in Kubernetes and
            
            
            
              Victoria metrics. Also,
            
            
            
              we at Victoria Matrix also don't know how to reduce
            
            
            
              the number of time series and Victoria. New versions
            
            
            
              of Victoriametrics increased the number of exported time series over
            
            
            
              time. You can see that the number of new metric
            
            
            
              names which are exposed by Victoria metrics growth around
            
            
            
              three times during the last four years
            
            
            
              and only 30% of these metrics
            
            
            
              are actually used by Victoria metrics,
            
            
            
              dashboards and alerts and records and rules.
            
            
            
              How can we improve the situation? We believe
            
            
            
              that kubernetes monitoring complexity must be reduced
            
            
            
              and the number of exposed metrics must be
            
            
            
              reduced. The number of histograms must be
            
            
            
              reduced because the histogram is the
            
            
            
              biggest offender of cardinal which
            
            
            
              generates many new time series,
            
            
            
              the number of parametric labels must be reduced.
            
            
            
              For instance, in kubernetes it is common practice to put
            
            
            
              all the labels which are defined at pod level to all
            
            
            
              the metrics exposed by this pod and
            
            
            
              probably this not correct. We should change these
            
            
            
              situation. Time series
            
            
            
              churn rate must be reduced the most common time
            
            
            
              series churn rate source in kubernetes
            
            
            
              is horizontal port auto scaling and deployments.
            
            
            
              And we should think hard how to reduce
            
            
            
              churn rate for these sources.
            
            
            
              And we believe that community will come up with a standard for kubernetes
            
            
            
              monitoring which will be much
            
            
            
              lightweighter and will need to collect a much lower
            
            
            
              number of metrics compared to the current state of
            
            
            
              monitoring of kubernetes. So let's do it together.
            
            
            
              Now you can ask questions.