Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, glad to be here at Cloud Native 2021.
And today I'd like to talk to you about monitoring microservices
the right way. But first, a word about myself off my name is Dotan
Horowitz. I'm a developer advocate at Logs IO. I've been a
developer, a solutions architect, a product manager. I'm an advocate
of open source software and open communities in general and the CNCF,
the cloud Native Computing foundation in particular. I'm a
co organizer of the local chapter of CNCF in Israel,
CNCF Tel Aviv. So if you're around, do join us for the meetups.
I run a podcast on open source observability and
DevOps open observability talks,
and generally you can find me everywhere at Horovits quick
word about logs IO. We provide SaaS
platform for cloud native observability, which essentially means that you
can just send your logs, your metrics, your track to a managed
service for storing, indexing,
analytics and visualization. And the nice thing
is that it's all based on popular open source such as kibana,
elasticsearch, Prometheus, Yeager,
Opentelemetry and so on. So if you're interested, you have the link here
for the conference. And with that, enough vendor talk. Let's talk about
monitoring microservices. But before we go there,
let's recap on how we used to do monitoring.
Until that point, the popular combination was statsd
and graphite, two popular open source tools. We had simple
apps that pushed metrics to statsd over UDP.
StatsD server aggregated the metrics and sent them over TCP
to graphite backend for storing and visualization.
A metric would look something like what you can see here on the screen.
It's graphite's hierarchy notation,
essentially representing the way that the app is in deployment.
So in this case, you have prod production environment inside it web
server inside it, the HTML service and these respective metrics
for that service. And it all worked very fine and well for
quite some time. But then came microservices and cloud
native architectures, and things started getting messy. So I'd
like to look at these new paradigm shifts and
the new challenges that they brought in monitoring first, obviously,
is microservices, which brought an explosion
in the number of discrete entities that we need to monitor. Here on
the screen, you can see the famous death stars by Amazon
and Netflix. The microservice architecture diagrams with
thousands of proprietary, interacting proprietary microservices,
but even much more modest deployments. What used to be a single
monolith is now dozens or more microservices,
each one a cluster with multiple instances of course, each one potentially
running its own programming language and databases,
each one independently deployed and scaled and upgraded,
which means a very dynamic and ephemeral environment.
So from monitoring perspective, not only are we talking about an explosion
in the number of entities that we need to monitor, but also a very,
very diverse and dynamic environment to
monitor, which means of course high volume of metrics and
many dimensions. So microservices is one trend
and the other one, the other paradigm shift is the shift to cloud native
architecture. And with that, now we need to monitor applications spanning
multiple containers, multiple nodes on multiple
namespaces, deployment versions, potentially over fleets of clusters.
And this introduces many additional dimensions to
my metrics, what's called high cardinality metrics,
because now I can ask about my microservice performance per
pod, per node, per version and so on. In addition,
the container, runtime, docker or other, and the Kubernetes
are themselves additional critical systems that I now have
that I need to monitor. In this example, you can see
various Kubernetes services in this diagram, like Kubelet Kproxy
on the node itself or on the control plane, the ETCD
and the scheduler and so on. So each and every one of those I need
to now monitor as well. So cloud native is the second paradigm
shift and the next one that I would like to talk about is open
source and cloud services. Now it's not so new obviously,
and it doesn't specifically talk about microservices and cloud
native, but it definitely got boost by
them. And any typical system these days has multiple
libraries and tools and frameworks for building
the system itself end to end, including web servers and
databases and NoSQL databases and API gateways and message
brokers and queues and you name it. For example, here on these diagram
you can see a typical data processing pipeline and just see
how many potential third party tools, whether open source
or cloud services, you have to implement this. And this is
only the data processing of this part of the system, and we need to
monitor all these third parts. So monitoring systems now need
to provide integration with large and dynamic
ecosystem of third parts platforms to provide complete observability.
And let's face it, this tech track keeps updating
in an increasing pace. The chase after the latest tech stack is becoming crazy,
and so is keeping the monitoring up to speed with the latest and
greatest. So just to recap, seen several challenges.
First is the shift to microservices that introduced an explosion
in the number of entities, high volume of metrics with many dimensions.
Then cloud native introduced more layers,
more virtualization, adding many additional dimensions like
per pod, per node, per version and so on. And also new critical
systems to monitor. And thirdly, the large and dynamic ecosystem
of third party platforms that we use in our system, whether or
open source or cloud services, which we need to
monitoring. So new challenges obviously call for new capabilities.
And I would like now to look at the requirements that you'd
be looking for when designing your monitoring system to make
sure that it can meet these challenges. The first requirement would be
flexible query. Now we saw before this
hierarchy notation for metrics in graphite
that worked well for static machine centric metrics.
But as we saw, these modern systems are much more
dynamic and much more distributed and have many, many more
dimensions, high cardinality. And it's becoming quite
restrictive to try and represent that with this hierarchy model.
For example, if we take the same example of this web server now,
these days, this is probably going to be a service spreading
across, deployed across multiple pods on multiple nodes. And then I
may want to look at this metric on a specific pod,
or maybe on all the pods on a specific node. Or then again I may
want to look at it across the service which spans the entire cluster.
And in the dot notation, introducing a new dimension
effectively creates a new metric which makes it difficult to expose highly
dimensional data, and it also is difficult to do query based
aggregations. So if I want to group by some
dimensions that I haven't pre aggregated in advance, like I don't
know, getting all the 500 code out
of a bunch of servers, how do I do that? With a hierarchy
model, it's quite difficult. So this restriction
drove the shift to a more flexible query,
namely from hierarchy model to a labeling model.
This was nicely introduced with the PromQl query
language by Prometheus open source project. Since then, by the way,
it's been adopted by many others of course. And if we look at the same
example, what it means is essentially that each metric comes
with a set of labels, a list of labels. Each label
is essentially a key value pair.
And once there I can just query on any
dimension, any label, or any combination of labels that
interests me. In this example, for example, if I want per pod, I just specify
server equals pod seven. Or if I want on the entire
service, I'll say service equals engines. And of course I can combine. And this
is a pretty basic example. If you think about real
life web applications that have the web server software
and the environment and the HTTP method and the error
code and the HTTP response code and the endpoints
and so on. The number of ad hoc queries that I
may want to create is almost endless. So if I want to
ask questions like what's the total number of results per web
server pod in production? Or what's the number of HTTP errors
using the engine server on a specific endpoint in staging
or things like that,
it becomes easy to query that with the labeling
model. So flexible querying is definitely the first
requirement that we would like to have. The next up is metric scraping
and auto discovery. As we've seen in stats these days, apps pushed
metrics, but systems these days, as we talked about it,
has many microservices, many third parts,
tools, frameworks and how do you considered each
and every microservice and especially third parties that you don't control
the code to push metrics to your monitoring back end?
It's rather impractical. What we would like in the new approach is
for the monitoring system itself to
discover the services or the components automatically and essentially
pull the metrics off of them. Prometheus project actually introduced this
notion exactly. Prometheus can detect the services
automatically called targets in Prometheus
terms and then pull the metrics off these targets,
scrapes the metrics in Prometheus terms. We can
do that thanks to open metrics. And open metrics
is an open standard for transmitting metrics. That's come
the standard for exposing metrics in the industry.
In the CNCF sandbox, it's based on Prometheus format.
It's proposed as a standard these days to the IETF.
But most importantly is that it's widely adopted.
Many common tools and frameworks expose out of the box metrics using
this format. So whether you use kafka or
RabbitmQ or MongoDB or MysQL or Apache,
NgInX, Jenkins, GitHub,
cloud services, you name it. They are
more than probably that they expose their metrics using
this format. And by the way, it's very easy to check. You just
go to metrics typically and endpoint.
You can even see that on the browser because it's textual
and it's very, very popular. And this large
ecosystem is key when dealing with such
diverse and dynamic systems. As we talked about it,
another nice thing about Prometheus is the native integration
with Kubernetes and the CNCF ecosystem, which is not surprising
because Prometheus is part of the CNCF.
It's a secondly graduated project there after Kubernetes
and Prometheus can pull from kubernetes both the discovery,
so it can discover the services running on kubernetes through kubernetes
and also fetch the metrics with plugins to the Kubernetes
API Kubernetes metrics. It can also use console and
others. So a very nice integration that makes it very seamless.
If you run on cloud native and kubernetes, it would be very
easy to start with Prometheus. Also, if you run on cloud services,
Prometheus can plug into the service discovery of
different cloud environments and pull from there.
So very nice integration and ecosystem there.
And the next requirement is core scalability.
As we've seen, systems emit massive amounts of
high cardinality metrics, which means high volume of
time series data that we need to store and query effectively.
Prometheus, with all its virtues that we've seen, is still
by design a single node installation, which means that it
doesn't scale horizontally. One option of scaling Prometheus is
to compose Prometheus instances into a federated
architecture, shard the matrix metrics and so
on, but that can become quite complicated to manage.
Another option is to write to a long term storage backend.
Prometheus has a built in capability to remote write to a
backend store. And again, there is a very rich ecosystem
of time series databases that can serve
as long term storage for Prometheus, whether proprietary or
open source tools, some of them also under
the CNCF itself like thanos and cortex.
So you have quite a variety to choose from, both ones
that will be self managed and ones that are
cloud services like logs IO to choose
from with different trade offs. So to summarize, we've seen
that monitoring microservices and cloud native systems
introduces challenges with massive
amounts of instances, high volume diversity,
highly distributed, the high dimensionality
in cardinality, many third parts involved,
and in order to monitor that, you'd look for first,
flexible querying with high cardinality, with the labeling model
being the prime model for these sorts of ad hoc queries,
then efficient metric scraping with auto discovery.
So we don't need to bother about each and every piece shipping
the metrics to your system, and thirdly, these scalability
to handle large volumes of metrics. As an open
source fan, I'm glad to say that open source is taking the lead here.
We talked about Prometheus as a popular tool
for that, Openmetrics as the emerging open standard
for that many time series databases that can serve
as long term storage to scale. Prometheus very
good integration with Kubernetes and the CNCF ecosystem.
And by the way, also beyond CNCF, the general ecosystem,
open source is definitely a driver here and is also a good starting point
for your journey if you're looking for solutions. If you're looking for
more information, I wrote a blog post
about this topic with a bit of a greater detail and some examples.
I also have an interesting episode on the open observability
Talks podcast talking about this topic and obviously
the open source forums
themselves, whether Prometheus or Openmetrics. And since
these things change so quickly, these obviously the ongoing
discussions on the CNCF track channels and the gitter and the mailing
list are more than welcome to make sure that you're up to speed.
And of course, feel free to reach out to me with
any question or comment. I'd be more than happy to follow
up. I'm Dotan Horovits and thank you very much
for listening.