Monitoring Microservices The Right Way

Video size:

Abstract

Modern systems today are far more complex to monitor. Microservices combined with containerized deployment results in highly dynamic systems with many moving parts across multiple layers. These systems emit massive amounts of highly dimensional telemetry data from hardware and the operating system, through Docker and Kubernetes, all the way to application and its databases, web proxies and other frameworks.

Many have come to realize that the commonly prescribed Graphite+StatsD monitoring stack is no longer sufficient to cover their backs. New requirements need to be considered when choosing a monitoring solution for the job, including scalability, query flexibility and metrics collection.

In this talk Horovits will look at the characteristics of modern systems and what to look for in a good monitoring system. He will also discuss the common open source tools, from the days of Graphite and StatsD to the currently dominant Prometheus. This talk will put you on the right track for choosing the right monitoring solution for your needs.

Summary

Dotan Horowitz is a developer advocate at Logs IO. We provide SaaS platform for cloud native observability. We can send your logs to a managed service for storing, indexing, analytics and visualization. If you're interested, you have the link here for the conference.
Until that point, the popular combination was statsd and graphite, two popular open source tools. But then came microservices and cloud native architectures, and things started getting messy.
Microservices is one trend and the other one is the shift to cloud native architecture. With that, now we need to monitor applications spanning multiple containers, multiple nodes on multiple namespaces, deployment versions, potentially over fleets of clusters. This means a very, very diverse and dynamic environment to monitor.
Cloud native is the second paradigm shift and the next one that I would like to talk about is open source and cloud services. Monitoring systems now need to provide integration with large and dynamic ecosystem of third parts platforms to provide complete observability.
The next up is metric scraping and auto discovery. Prometheus can detect the services automatically called targets in Prometheus terms and then pull the metrics off these targets. The next requirement is core scalability. Prometheus is native integration with Kubernetes and the CNCF ecosystem.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, glad to be here at Cloud Native 2021. And today I'd like to talk to you about monitoring microservices the right way. But first, a word about myself off my name is Dotan Horowitz. I'm a developer advocate at Logs IO. I've been a developer, a solutions architect, a product manager. I'm an advocate of open source software and open communities in general and the CNCF, the cloud Native Computing foundation in particular. I'm a co organizer of the local chapter of CNCF in Israel, CNCF Tel Aviv. So if you're around, do join us for the meetups. I run a podcast on open source observability and DevOps open observability talks, and generally you can find me everywhere at Horovits quick word about logs IO. We provide SaaS platform for cloud native observability, which essentially means that you can just send your logs, your metrics, your track to a managed service for storing, indexing, analytics and visualization. And the nice thing is that it's all based on popular open source such as kibana, elasticsearch, Prometheus, Yeager, Opentelemetry and so on. So if you're interested, you have the link here for the conference. And with that, enough vendor talk. Let's talk about monitoring microservices. But before we go there, let's recap on how we used to do monitoring. Until that point, the popular combination was statsd and graphite, two popular open source tools. We had simple apps that pushed metrics to statsd over UDP. StatsD server aggregated the metrics and sent them over TCP to graphite backend for storing and visualization. A metric would look something like what you can see here on the screen. It's graphite's hierarchy notation, essentially representing the way that the app is in deployment. So in this case, you have prod production environment inside it web server inside it, the HTML service and these respective metrics for that service. And it all worked very fine and well for quite some time. But then came microservices and cloud native architectures, and things started getting messy. So I'd like to look at these new paradigm shifts and the new challenges that they brought in monitoring first, obviously, is microservices, which brought an explosion in the number of discrete entities that we need to monitor. Here on the screen, you can see the famous death stars by Amazon and Netflix. The microservice architecture diagrams with thousands of proprietary, interacting proprietary microservices, but even much more modest deployments. What used to be a single monolith is now dozens or more microservices, each one a cluster with multiple instances of course, each one potentially running its own programming language and databases, each one independently deployed and scaled and upgraded, which means a very dynamic and ephemeral environment. So from monitoring perspective, not only are we talking about an explosion in the number of entities that we need to monitor, but also a very, very diverse and dynamic environment to monitor, which means of course high volume of metrics and many dimensions. So microservices is one trend and the other one, the other paradigm shift is the shift to cloud native architecture. And with that, now we need to monitor applications spanning multiple containers, multiple nodes on multiple namespaces, deployment versions, potentially over fleets of clusters. And this introduces many additional dimensions to my metrics, what's called high cardinality metrics, because now I can ask about my microservice performance per pod, per node, per version and so on. In addition, the container, runtime, docker or other, and the Kubernetes are themselves additional critical systems that I now have that I need to monitor. In this example, you can see various Kubernetes services in this diagram, like Kubelet Kproxy on the node itself or on the control plane, the ETCD and the scheduler and so on. So each and every one of those I need to now monitor as well. So cloud native is the second paradigm shift and the next one that I would like to talk about is open source and cloud services. Now it's not so new obviously, and it doesn't specifically talk about microservices and cloud native, but it definitely got boost by them. And any typical system these days has multiple libraries and tools and frameworks for building the system itself end to end, including web servers and databases and NoSQL databases and API gateways and message brokers and queues and you name it. For example, here on these diagram you can see a typical data processing pipeline and just see how many potential third party tools, whether open source or cloud services, you have to implement this. And this is only the data processing of this part of the system, and we need to monitor all these third parts. So monitoring systems now need to provide integration with large and dynamic ecosystem of third parts platforms to provide complete observability. And let's face it, this tech track keeps updating in an increasing pace. The chase after the latest tech stack is becoming crazy, and so is keeping the monitoring up to speed with the latest and greatest. So just to recap, seen several challenges. First is the shift to microservices that introduced an explosion in the number of entities, high volume of metrics with many dimensions. Then cloud native introduced more layers, more virtualization, adding many additional dimensions like per pod, per node, per version and so on. And also new critical systems to monitor. And thirdly, the large and dynamic ecosystem of third party platforms that we use in our system, whether or open source or cloud services, which we need to monitoring. So new challenges obviously call for new capabilities. And I would like now to look at the requirements that you'd be looking for when designing your monitoring system to make sure that it can meet these challenges. The first requirement would be flexible query. Now we saw before this hierarchy notation for metrics in graphite that worked well for static machine centric metrics. But as we saw, these modern systems are much more dynamic and much more distributed and have many, many more dimensions, high cardinality. And it's becoming quite restrictive to try and represent that with this hierarchy model. For example, if we take the same example of this web server now, these days, this is probably going to be a service spreading across, deployed across multiple pods on multiple nodes. And then I may want to look at this metric on a specific pod, or maybe on all the pods on a specific node. Or then again I may want to look at it across the service which spans the entire cluster. And in the dot notation, introducing a new dimension effectively creates a new metric which makes it difficult to expose highly dimensional data, and it also is difficult to do query based aggregations. So if I want to group by some dimensions that I haven't pre aggregated in advance, like I don't know, getting all the 500 code out of a bunch of servers, how do I do that? With a hierarchy model, it's quite difficult. So this restriction drove the shift to a more flexible query, namely from hierarchy model to a labeling model. This was nicely introduced with the PromQl query language by Prometheus open source project. Since then, by the way, it's been adopted by many others of course. And if we look at the same example, what it means is essentially that each metric comes with a set of labels, a list of labels. Each label is essentially a key value pair. And once there I can just query on any dimension, any label, or any combination of labels that interests me. In this example, for example, if I want per pod, I just specify server equals pod seven. Or if I want on the entire service, I'll say service equals engines. And of course I can combine. And this is a pretty basic example. If you think about real life web applications that have the web server software and the environment and the HTTP method and the error code and the HTTP response code and the endpoints and so on. The number of ad hoc queries that I may want to create is almost endless. So if I want to ask questions like what's the total number of results per web server pod in production? Or what's the number of HTTP errors using the engine server on a specific endpoint in staging or things like that, it becomes easy to query that with the labeling model. So flexible querying is definitely the first requirement that we would like to have. The next up is metric scraping and auto discovery. As we've seen in stats these days, apps pushed metrics, but systems these days, as we talked about it, has many microservices, many third parts, tools, frameworks and how do you considered each and every microservice and especially third parties that you don't control the code to push metrics to your monitoring back end? It's rather impractical. What we would like in the new approach is for the monitoring system itself to discover the services or the components automatically and essentially pull the metrics off of them. Prometheus project actually introduced this notion exactly. Prometheus can detect the services automatically called targets in Prometheus terms and then pull the metrics off these targets, scrapes the metrics in Prometheus terms. We can do that thanks to open metrics. And open metrics is an open standard for transmitting metrics. That's come the standard for exposing metrics in the industry. In the CNCF sandbox, it's based on Prometheus format. It's proposed as a standard these days to the IETF. But most importantly is that it's widely adopted. Many common tools and frameworks expose out of the box metrics using this format. So whether you use kafka or RabbitmQ or MongoDB or MysQL or Apache, NgInX, Jenkins, GitHub, cloud services, you name it. They are more than probably that they expose their metrics using this format. And by the way, it's very easy to check. You just go to metrics typically and endpoint. You can even see that on the browser because it's textual and it's very, very popular. And this large ecosystem is key when dealing with such diverse and dynamic systems. As we talked about it, another nice thing about Prometheus is the native integration with Kubernetes and the CNCF ecosystem, which is not surprising because Prometheus is part of the CNCF. It's a secondly graduated project there after Kubernetes and Prometheus can pull from kubernetes both the discovery, so it can discover the services running on kubernetes through kubernetes and also fetch the metrics with plugins to the Kubernetes API Kubernetes metrics. It can also use console and others. So a very nice integration that makes it very seamless. If you run on cloud native and kubernetes, it would be very easy to start with Prometheus. Also, if you run on cloud services, Prometheus can plug into the service discovery of different cloud environments and pull from there. So very nice integration and ecosystem there. And the next requirement is core scalability. As we've seen, systems emit massive amounts of high cardinality metrics, which means high volume of time series data that we need to store and query effectively. Prometheus, with all its virtues that we've seen, is still by design a single node installation, which means that it doesn't scale horizontally. One option of scaling Prometheus is to compose Prometheus instances into a federated architecture, shard the matrix metrics and so on, but that can become quite complicated to manage. Another option is to write to a long term storage backend. Prometheus has a built in capability to remote write to a backend store. And again, there is a very rich ecosystem of time series databases that can serve as long term storage for Prometheus, whether proprietary or open source tools, some of them also under the CNCF itself like thanos and cortex. So you have quite a variety to choose from, both ones that will be self managed and ones that are cloud services like logs IO to choose from with different trade offs. So to summarize, we've seen that monitoring microservices and cloud native systems introduces challenges with massive amounts of instances, high volume diversity, highly distributed, the high dimensionality in cardinality, many third parts involved, and in order to monitor that, you'd look for first, flexible querying with high cardinality, with the labeling model being the prime model for these sorts of ad hoc queries, then efficient metric scraping with auto discovery. So we don't need to bother about each and every piece shipping the metrics to your system, and thirdly, these scalability to handle large volumes of metrics. As an open source fan, I'm glad to say that open source is taking the lead here. We talked about Prometheus as a popular tool for that, Openmetrics as the emerging open standard for that many time series databases that can serve as long term storage to scale. Prometheus very good integration with Kubernetes and the CNCF ecosystem. And by the way, also beyond CNCF, the general ecosystem, open source is definitely a driver here and is also a good starting point for your journey if you're looking for solutions. If you're looking for more information, I wrote a blog post about this topic with a bit of a greater detail and some examples. I also have an interesting episode on the open observability Talks podcast talking about this topic and obviously the open source forums themselves, whether Prometheus or Openmetrics. And since these things change so quickly, these obviously the ongoing discussions on the CNCF track channels and the gitter and the mailing list are more than welcome to make sure that you're up to speed. And of course, feel free to reach out to me with any question or comment. I'd be more than happy to follow up. I'm Dotan Horovits and thank you very much for listening.

Slides

Download slides (PDF)

See all 45 talks at this event!

Conf42 Cloud Native 2021 - Online

April 29 2021

Monitoring Microservices The Right Way

Video size:

Abstract

Summary

Transcript

Slides

Dotan Horovits

Product Evangelist @ Logz.io

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2021 - Online

April 29 2021

Monitoring Microservices The Right Way

Video size:

Abstract

Summary

Transcript

Slides

Dotan Horovits

Product Evangelist @ Logz.io

Join the community!