Fast, Cheap, DIY Monitoring with Open Source Analytics and Visualization

Video size:

Abstract

Learn how to build a fast, cost-effective monitoring system for any software service using open-source tools ClickHouse and Grafana. Join us to discover the key elements of ingest, query, and visualization, and start monitoring your system like a pro!

Summary

Robert Hodges is presenting today at Comp 42 devsecops 2023. He'll talk on fast, cheap doityourself monitoring with Opensource analytics and visualization. He'd like to thank the organizers for inviting him to talk.
Robert Hodges has been working on databases for over 30 years. He's been heavily involved with open source Kubernetes security. Other issues related to operational topics around managing data, particularly in cloud and cloud native environments.
For real time analytics, one of the best choices on the market is Clickhouse. Clickhouse allows you to run queries and get answers back almost instantly. It's in fact used for an enormous number of use cases ranging from web analytics to network flow logs. Here's how to build your own monitoring system.
clickhouse is optimized for very fast response on large data sets. Each column is stored separately, basically as an array. In observability cases, we can often get compression levels of 90 or even 95% on data. Clickhouse is extremely good support for time ordered data.
Grafana pairs really well with Clickhouse when you're building observability applications. Why is Grafana good? Well, it's built around display of time series data. How do we go then to build an actual monitoring application?
Using clickhouse. com's Grafana plugin, we can monitor the system at the ground level. We can create a simple display that shows me my cpu, and then a more detailed cpu usage graph. You can also turn this data into further displays.
You can build this system yourself and basically monitor anything you want and collect practically any kind of data you want. Clickhouse is a very popular project, certainly among the most popular analytic databases across all of GitHub. There's a huge number of libraries and software packages that work with it.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

You. Hi everybody, my name is Robert Hodges and I'd like to welcome you to my talk on fast, cheap doityourself monitoring with Opensource analytics and visualization. I'm presenting today at Comp 42 devsecops 2023. I'd like to thank the organizers for inviting me to talk and for doing all the work to make this conference possible. Thanks a bunch. It's a pleasure to be here. Let's do a few intros. So, my name again, Robert Hodges. I've been working on databases for over 30 years. Actually this year it's 40. And I've been heavily involved with open source Kubernetes security. Other issues related to operational topics around managing data, particularly in cloud and cloud native environments. My day job, I run a company called Altinity. We are a service provider for Clickhouse. It's a very popular data warehouse. We'll be talking about this good chunk of this talk. We run a cloud, so we have hundreds of clusters that we run on behalf of people. We also help a very large number of people run it themselves. Among other things were the authors of the Kubernetes operator for Clickhouse. So if you run clickhouse and use cloud native approach, run it on Kubernetes, it's a good chance. Using our software already. And just a little bit about my colleagues who've helped put together the information behind this talk. We have about 45 people in the company, spread out over 16 countries. We are, by and large, database geeks, centuries of experience with database and applications, particularly analytic databases. So let's jump right in. Monitoring. Why do we do it? Well, it could be because we like looking at nice screens, but really it's to answer questions. So when something happens in your system, for example, users start to see performance problems. You want to know why. And as you dig deeper, like when do the performance problems start? How many users are affected? Which of the services is at fault? These are questions that require data, and moreover, they require a history of the systems in order to answer. Now, in the old days, we used to take a slightly different approach, which leads to a question, what's the best way to answer those questions I showed on the previous slide? Here's the old way. Go into the system, lay hands on it, run vm stat, kind of watch the numbers until they become blurry. So would you like to do it this way, or would you like to do it visually? So chances are if you're in this business, you already have monitoring like this. This is actually Grafana, which we'll be talking about. But this type of visual display is much easier to understand, interpret and use. So visual displays, well, there's a lot of systems that will actually do this that come right off the shelf. Now. There have been proprietary solutions developed in this space for years. And in fact, if anything, over the last few years we've seen a blossoming of systems to do observability in general and monitoring system monitoring in particular. But perhaps they're not for you. One simple reason is they can be very costly. But another one is that you may have specialized needs for monitoring that you need to cover. Perhaps your business is monitoring, so you don't want to use somebody else's system. You're developing your own, you may want to own the stack. There's a bunch of reasons you may want to control the data. There's a bunch of reasons why you might want to do it yourself. So let's look at how to do that. The basic system that we're going to build to do monitoring is we're going to have a system that consists of three parts. We have the source data, so we need something that can collect that data and ingest it. We need a place for it to live. And that's what we call an analytic database. This is a database that's designed to hold large amounts of data and answer questions on it very quickly. And then finally we need a mechanism to display it so that you end up with some nice graphical visualization like what I showed you a couple of slides ago. So let's look into how we would go about building that type of system. So the first thing we want to do is pick an open source analytic database. Open source databases tend to be problem specific, and as we're looking at them, there are several that you might consider. So you might consider open search, which is the open source version of elasticsearch that's great for full text searches on unstructured data. It can be used for log analytics. You could also use Presto, which is a very powerful database that can do federated query across many data sources and information and data lakes. But for this type of system, particularly observability, one of the best choices on the market is Clickhouse, which allows you to do real time analytics. By that I mean be able to run queries and get answers back almost instantly. And it can do this performantly and easily on very, very large quantities of data. It's in fact used for an enormous number of use cases ranging from web analytics to network flow logs. Observability, of course, financial markets, seim, so on and so forth. Super popular for this and a great choice. So here's a short list of reasons why Clickhouse has turned out to be such a good choice for so many people. So it is a SQL database. In fact, in many ways it has the simplicity and the accessibility of a system like MySQL. So it understands SQL, it runs practically anywhere. You can literally run it on a phone. There was actually a demo of that a few years back, all the way up to running it in huge clusters containing hundreds of servers in the cloud. It's also open source in this particular case, Apache 20, which is super flexible and gives you the ability to run it for any purpose. In addition, it has very powerful analytic capabilities. So it shares this with many, in fact a number of features that are standard for analytic databases. Overall, they include storing data in columns. We'll show an example of that in the following page about why that's such an important feature. It can also parallelize execution very well, so that the data is organized so it can be read quickly, and then it can read from many locations in parallel, and it scales to many petabytes. So these are all good reasons why Clickhouse has become a core engine for real time analytics. Across the use cases that I mentioned, let's look at some of the details that are relevant for observability and monitoring. So as I mentioned, clickhouse is optimized for very fast response on large data sets. So if you actually look at the data, you can see that you'll quickly see, particularly if you go in and look at its storage, that each column is stored separately, basically as an array. So when you go and look at the on disk representation, the columns, each of them has a couple of files that elements it. So within that column you have very highly compressed data. Putting things into an array like this makes it easier to compress. Moreover, it's sorted, which can result in and then compression is applied to it. Particularly in observability cases, we can often get compression levels of 90 or even 95% on data. Second thing is we have replication between nodes, so we can maintain multiple nodes and then query across them. And the third using is we have vectorized or parallelized query, which can run across all nodes. In fact, you can also divide data up into shards and run parallel query across them. This allows you to apply the power of multiple machines if you need a fast answer. And then within single machines, clickhouse is extremely efficient. It uses what's called vectorized query, where we basically treat these columns basically as array values, because that's how they're stored and can take advantage of things like SIMD instructions, a single instruction, multiple data. Also, we have great performance because this kind of data aligns well with the cache structure in modern cpus. So for all these reasons and more, clickhouse tends to be extremely fast. Another thing that clickhouse does that makes it very nice for monitoring data is it has a huge number of what we call input formats. So these are things like CSV, which is one of the most widely used formats in all of it, but also CSV with names where you have the name of each column in the first row. We can read Json, we can read what's called Json. Each row, which is a record. Each record is a separate JSON document, protobuff, parquet, tab separated, you name it, there's dozens of these. And what this means is that there's a pretty good chance that the data that's being emitted from your monitoring system, clickhouse, just knows what it is and can read it and stick it in a table. Finally, once you get in the table, clickhouse is extremely good support for time ordered data. And that's important because monitoring data is fundamentally time series. It is a series of measurements on things, for example like hosts, that have particular properties and then particular measurements associated with a point in time on that host. So there are three date types, regular dates, which are pretty useless for high granularity data, but date time, which is your typical Unix timestamp, and then date time 64, so you can get precision down to nanosecond. Bi tools tend to like date time, and then there's a whole raft of functions that allow you to process the data. So for example, to be able to normalize a date to the year, to the nearest hour, to the start of the year, so on and so forth, as well as a bunch of conversion functions to pop out to turn it into a month, so on and so forth. So these are all great reasons for using Clickhouse and that make it particularly effective for this kind of application. Speaking of Grafana, Clickhouse pairs really well with Grafana when you're building observability applications. In fact, there's a pretty good chance that many of you who are listening to this talk already use Grafana for this purpose. Why is Grafana good? Well, first of all, it's built around display of time series data. It's very simple to install. It has piles of data sources. So we will be using a data source that can read Clickhouse data, but it can also read prometheus it can read mySql, you name it. If there's a database, Grafana can connect to it and use it. Moreover, for displaying the data, it has a pile of plugins. This example on the right just shows a few of them, but time series sort of heat maps, tabular displays, and they're very easy to set up and apply to the data. One of the things that makes it particularly strong for monitoring, it has very good zoom in and zoom out. So the ability to look at different timescales, to look at different series at a particular time sales scale, these are all things that you like to have when you're trying to drill in, understand the data, sift through what you're seeing, and then zero in on a problem. And that's taken together, this makes it great for monitoring dashboards. And then the final thing which makes it a good match for Clickhouse is it's open source, in this case AGPL 30. So how do we go then to build an actual monitoring application? What we're going to do is start with those VM stat commands that I showed you a few slides ago, and we're actually going to turn that into data in a table in clickhouse and then display it in Grafana. So let's dig in and show how to do that. It's really not that hard. So the first thing is we need to generate the VM stat data. So here is a simple python script. It's about 14 lines that is actually going to run VMStat at 1 second intervals and then basically split the results up and stick them in a JSON document. If you look around, there are plenty of tools that will do this automatically. I just wrote it myself because it's really simple to do. Data collectors are just not that hard to write. So you can read the code, and if you carefully look at it, you can prove to yourself that it's eventually doing adjacent, that it's basically constructing a dictionary and then dumping it out as JSON key value properties. To understand the data, it's a little bit easier just to go look at it. So here's the output that you get. So the key value pairs, you get a timestamp. That's really important because that's your time ordering. And then you get a bunch of properties, including the host and things like that, and then actual measurements, like for example, the idle time here, which is 98%. So this is the data that we're going to be loading into clickhouse. So the next thing we need to do is we need to design a clickhouse table to hold the data. So clickhouse, unlike a database like Prometheus, for example, does require data to be in a tabular format. But Clickhouse is very, very tolerant of what it considers to be a table. In this particular case, we're taking a pretty conventional approach. So we're going to take things like the timestamp, the day, the host, and we're going to consider those to be dimensions. So these are the properties of the measurement, and then what we have is the measurements themselves. So these are just all the data that we get out of the vm stat command. So the amount of free memory, amount of buffer cache, the different amounts of percentage of time, sort of ways that the CPU is using its time, so on and so forth. One thing to notice, if you've used SQL databases before, is down at the bottom. We have this engine equals merge tree. So for MySQL users, this will be familiar. This is a particular way of organizing a table merge tree. In Clickhouse is the workhorse table for large data or for big data, and it has partitioning built into it. So you have to give clickhouse a bit of a clue how you want the data broken up. In this case, we're doing it by day. This would be appropriate if we're building a system, for example, it holds a year of data, and then we also give ordering. This is something in analytical databases that's critical. You need to give a sort order to the data. And if you do this correctly, and this here, we're sorting by the host followed by the timestamp. This will order the data in such a way that, among other things, the values between successive rows will be very similar and will compress incredibly well. So next step, we've got the table, we've got the data. Let's get the data loaded into that table in clickhouse. So the actual SQl insert command to do this, if we've got the data, is really simple. So this is a JSON, each row format, every measurement results in a JSON document. So the top command is how we do that in SQL. The actual command to get this done is a little bit different. For example, we can go ahead and use curl to post this data. So this is an actual command that loads some of this, a file containing this data, and this is it. So literally two lines. Well, got to construct the insert command with Earl encoding, but that's it. So very simple to get it loaded. You can of course write a python script. That's in fact what I did because it's a little bit easier to control it than running inside shell. And then the final thing is you're going to want to construct a grafana dashboard. So in this particular case I've constructed, and we'll see this in action in just a few minutes. I've gone ahead and constructed a simple display that shows me my cpu, that's the top display, and then a more detailed cpu usage graph that actually breaks down the components of the CPU usage and then memory usage. The bottom two are done by host, so there's a little selector on top. When you're using clickhouse. There are a couple of plugins that you can use for Grafana. I prefer to use the one that we maintain, which is the alternative plugin for clickhouse. It's been around for years. It's had about 12 million downloads at this point. Incredibly popular, used across thousands and thousands of dashboards. So that's the one that's used to construct this display. And then finally, once you have this all set up, not only do you have the display, so you can go and look at this information directly through the display, play around with it as we'll do in a couple of minutes. But you have the full power of SQL and you can ask any question you want. You can go ahead and do this interactively off the command line, turn it into further displays. For example, this is a query that just shows all the hosts that had greater than 25% load for at least a minute in the last 24 hours. And it also sums the number of minutes that had that. So this is a way of seeing which hosts are running hot. So that's the system. It's really not very complicated. This is in total about 100 lines of code that was sufficient to get this. So let's go out and bounce out of the slides and go have a look at the system actually at the ground level. So here we go. This is the system that we saw a few minutes ago. And you can see the way that this is set up. It's monitoring a couple of hosts that I run in my home office. They're called logos two and had a temporary glitch in the audio there. So they run two hosts, logos two and logos three. And I can select particular hosts. You can see how this selector allows this to change them. I can look at all of them, in which case I don't get cpu specific, cpu and memory usage. Let's go pick a specific one. We'll go pick logos two. And I can also change the timescale. This is super easy to do. So here we go. We can switch this to the last 30 minutes. We can go see what's going on here. Let's go to last hour and see if there's anything interesting. You can see there's been some activity on the system, on this logos too. And in fact you can distinguish the different traces. So right here, without really doing anything special, you have a lot of insight into the load levels on these systems and you can basically drill in to get much closer views. This is something I love about Grafana that I can actually come in and I can just select a very small section and then the display automatically zeroes in on the part that I want to look at. Let's go ahead and get this back to doing the last five minutes and let's put some load on the system. Let's test this thing out. So for that we have a couple of handy commands. A great command to bash on the cpu is sysbench. So this is something you can just say apt install. Pseudo. Apt install sysbench. What I'm going to run is a cpu test. This is just going to beat up on the cpus and we're going to let this run for a minute or two and we will basically be able to see this beating up on the system tools. Actually, I made a slight mistake there. Let me run it on the logos two host because that is actually a more capable system. So we're going to go ahead and run the cpu test while that's running and the data is collecting. Let me just show you this, that all the code that we're using here is actually available in a project called Clickhouse. SQL examples. Let's go to the open source monitoring directory. So for example, the dashboard that I'm showing you is there the little python routines that we have here. These are loading the data into clickhouse. And then the python script which I showed you that actually generated the data. It's all there. So if you want to go ahead and do this, and as I say, it's about 100 lines of code total to get this whole thing to run. And then actually running it is very simple. It's as simple, the collectors are as simple as the following. So just run the collector pipe into the consumer and up it goes to clickhouse. Great. So that test has been running in the background on logos two. So let's see what we have. And actually you can see that. We can actually see the test going on and we can see the effect on the cpu. We ran it first on logos three. So right there. Ran it here on logos two. So we can see the cpu. Now what's a little bit more interesting is to bash on this a bit and actually do some work on, show the effect on the memory. That's a little bit more fun. Let's run another program. Let's kill the cpu test. And we're going to run a program called stress. So here it is. And this is a program that can beat up on your system, but it can also use a lot of memory. So this is basically spawning four threads. They're each going to eat about four g's. And you have to love any performance test program that calls its workers hogs. So off they go. And we'll actually see these coming up in the display. Let's go ahead and actually, let's change the time. Okay, so we can see these actually starting to use resources. There's the memory coming up. This is not actually putting enough load on the system. Let's beat it up a little bit more. Let's go ahead and add eight threads. So go ahead and put that in and give it a minute or two. And what we'll see now is this will put very heavy load on the system. So we'll start to see in this memory usage, we will start to see this climbing very rapidly. Colors here are probably not the best, but here we can see that it's actually putting heavy cpu load on this system. We can also see that up here. It's basically pegged at 100%. So this machine is just getting hammered. In fact, what's happening right here, this is kind of interesting. You basically, let me get that back to five minutes. It's zooming in too quickly. We've actually got gaps in collection. And what that indicates is the machine is so loaded that the collector is not even generating data. So that's the demo. This is something that I put together. I'm having all kinds of fun with this at the cost of about 100 lines of code. Of course, if you want to productize it, you're going to end up also storing the system configuration and managing Grafana, managing clickhouse. But the point is, you can build this system yourself and basically monitor anything you want and collect practically any kind of data you want. Okay, so let's go back to the slides and dig in further to some final notes. So if you're going to build a monitoring system, you can use Python, as I did, but you don't have to. One of the things that's great about Clickhouse is it's a very popular project, certainly among the most popular analytic databases across all of GitHub. It has a huge number of libraries and software packages that work with it, everything from Kafka to airflow. And then for display, as we saw Grafana, superset, cubejs, bunch of different client libraries. We do a lot of work with Golang in our work, but if you like Java, if you like Python, the drivers are all there. And then if you want to run it on kubernetes, there's the altinity operator for Clickhouse, which I mentioned at the start of this talk. This allows you to run Clickhouse very efficiently on kubernetes, which is turning out to be a really great place to run data. But of course you can run it on anything you want. Clickhouse runs great anywhere. You can run it on vms, of course, use ansible to manage it, so on and so forth. So where can you find out more? Well, there's the official docs for both the clickhouse project as well as Grafana. So these are shown here. We do in the course of our work with Clickhouse, as well as other products like Grafana, we write blog articles, we do a huge number of talks on on that we post on YouTube concerning topics related to running Clickhouse, as well as integration with other tools. We have a knowledge base that you can use to learn more about how to solve specific problems, particularly if you're operating at scale. And then it's just a pile of other opensource associated with Clickhouse. There's a very large community around this. We get thousands of contributions per year, ranging from sort of simple elements on issues all the way up to things like prs. Last year about 392 people, unique people on GitHub, submitted prs that were actually merged into Clickhouse. So that is my talk. Thank you very much and go out there and have fun. I'd like to thank the comp 42 folks once again for setting this conference up. It's great to present here and if you want to contact me, I'll be hanging out on discord as part of the conference. But you can also get to me at alternative. You can just go to the website, do contact us. We have a slack channel that you can join and you can just join that channel and dm me, or you can find me on LinkedIn. And once again, altinity. We do alternity cloud, which is a cloud for Clickhouse. We do builds of clickhouse, stable build and the alternative Kubernetes operator. Those are just a few of the many things that we do to help people operate Clickhouse at scale and build applications like the one I just showed you. So thanks again. Have a great day.

Slides

Download slides (PDF)

See all 33 talks at this event!

Conf42 DevSecOps 2023 - Online

November 30 2023

Fast, Cheap, DIY Monitoring with Open Source Analytics and Visualization

Video size:

Abstract

Summary

Transcript

Slides

Robert Hodges

CEO @ Altinity

Join the community!

Featured event

2025

2024

Info

Conf42 DevSecOps 2023 - Online

November 30 2023

Fast, Cheap, DIY Monitoring with Open Source Analytics and Visualization

Video size:

Abstract

Summary

Transcript

Slides

Robert Hodges

CEO @ Altinity

Join the community!