Conf42 Incident Management 2024 - Online

- premiere 5PM GMT

How to Monitor your Monitoring

Abstract

If your monitoring system falls over in the middle of the night does your team get paged? I hope the answer is yes, but if it isn’t this talk will provide simple cost effective solutions to get started, and even if your monitoring is monitored I can provide tips to help improve your existing setup.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone. My name is Matthias Palmersheim, and today I want to talk to you about how to monitor your monitoring and why it's important. A little bit about me. I'm a solutions engineer at Victoria Metrics, and Victoria Metrics makes an amazing time series database called Victoria Metrics. We also make some utilities for getting your data not only into Victoria Metrics, but other time series databases as well. We are also starting to make a tool called Victoria Logs for aggregating logs. And when I'm not working on that at my day job, I like to take a lot of the utilities we make as well as some others and aggregate it together into an easy to deploy monitoring and logging system. So that way you don't have to have a giant Kubernetes cluster or an expensive cloud service to get started with observability. So if you couldn't tell by that intro, I'm a huge observability nerd. And Actually, on my first date with my wife, I wouldn't shut up about observability. And there was a betting pool in the office to see if I would talk about observability the whole time. And the reason why I became obsessed with observability is one of my first IT jobs. It was a IT manager at a factory. And when I got there, I didn't really have any monitoring system, so to speak. And what would happen is an application would go down because I didn't have a monitoring system to help me. Prevent that from happening in the first place and get alerts for things like, Hey, hard drives filling up and the hard drive would just fill up and an application would fail and my alerting system at the time was users coming into my office and informing me that Hey, this change that was made recently took down the application and now the whole office can't work and that's costing the company tens of thousands of dollars an hour and that caused a lot of stress. So along with not having context as to why things were broken, I had to deal with the stress of having somebody there reminding me how expensive this problem was. So I could get nice automated notifications instead of angry users in my office. I implemented a monitoring system that collected the telemetry and that telemetry could tell me that, Hey, this resource is overused, or this thing is being slow before it finds me. failed and although I was angry that now I had alerts, things weren't going down as often. And, when people would come into my office after they broke, I would usually already be working on the problem and that helps lower stress a ton. but I had a problem though. A lot of the notifications I was getting from my monitoring system were coming through the same channels as things like status updates or. people, just like general announcements, Oh, Hey, you need to park in the other parking lot today. And so that led to a lot of alert fatigue because every time I got an email, I got the like stress of, is this just a simple update that I needed to park somewhere else? Or is this a tens of thousands of dollar an hour problem? So I looked into noisy notification systems that were dedicated for critical alerts. So it was super easy to set up. D and D override in my phone. And the other nice things about this is that it would make an individual responsible. So instead of just shouting out, Hey, the system's broken, somebody should fix this. It would assign somebody to the, incident. And the way I decide if something is an important. notification and it should be able to override like the notification prefaces on my phone is that if it costs life, liberty or property. So if it could cause physical harm, if the system is down, if it causes compliance issues and the government's going to get involved, or if it can cause loss to property and property could just mean that it reaches a certain cost threshold and it's losing the company more money than you would like for an outage. So I was starting to feel pretty happy about my existing observability system. I was able to get noisy notifications when things were breaking instead of email. But then I had a philosophical question. Is what happens if the monitoring system fails? So if my monitoring system fell over at 2 a. m., does anyone get paged? So this is the problem that I was running into where the monitoring system would fail. And I would think that everything's fine, whether or not the applications behind or that the monitoring system was responsible for were up. And to solve this problem, you just deploy a second monitoring system. So you set up a monitoring of monitoring system that only monitors the primary monitoring system. And then from the perspective of the primary monitoring system, you It's just another application. So it's just like the ERP system. It sends metrics and it alerts you when things aren't going as expected. There's a problem with this. The monitoring and monitoring system is also just another application can fail and multiple applications can fail at the same time for whatever reason. So this problem could be solved. Just be solved by adding more and more monitoring systems, but you never quite get to 100 percent availability you get more nines but with more nines of Availability you also get more costs and it's harder to maintain the knowledge if you have all those monitoring systems so usually the sweet spot is two But you don't just deploy two applications to the same region and in the same infrastructure as code inventory You need to make sure that the applications are deployed in a way that they're isolated from failures as much as possible. And before I get into all the ways that this can go wrong, I'm not implying that you and your teammates aren't smart. I'm saying that humans aren't perfect. We make mistakes, but Usually different groups of humans don't make mistakes at the same time. So you want to make a different group of humans as responsible for your monitoring and monitoring as much as possible. And another thing you need to do is to make sure that your change management processes are aware that these two systems shouldn't be updated at the same time because upgrades or configuration changes are frequent source of outages. So if you touch both systems, if you're allowed to touch both systems at the same time, then there's a high likelihood both fail at the same time. So the different tools that we need to make are the different technical ways that we can prevent both systems from failing. are to use different notification services because again, notifications are apps and apps fail sometimes. you want to make sure that they're in different infrastructure providers because usually, different infrastructure providers don't have simultaneous outages. If you can't get it inside of a different infrastructure provider or a different cloud service, at least try to get it inside of different regions and separate deployments within the same cloud service. So in summary, your monitoring and monitoring is another monitoring system, but it's only responsible for the primary monitoring system because it gets really confusing if you have different monitoring systems available for certain business applications. It can be harder to test the system too, because not only do you have to get the okay from your boss to test and intentionally fail a system, you have to. Get approval for a test that could have impacts to other teams as well. And the monitoring and monitoring system is just treated by the monitoring system as another application. So the minimum requirement is to do the most important thing in observability, which is make sure that your applications are available. This is going to be cheaper to store and easier to configure, because in most cases you're just configuring a connection between two things or feeding a list of URLs to some service and it's performing health checks. But the downside is that you don't get the, you don't get a lot of preventative alerts unless like the applications being slow. and so what happens if you don't get those contextual things is that when the application goes down, you have a much higher mean time to resolution because you have to figure out what went wrong as well as fix what went wrong. And again, this is responsible for keeping the most important applications in your business online. So you really should. treat it like another application and give it those preventative measures, give it really nice dashboards, have run books and all those other things that every other application in your business gets. And even if you are treating it like another application, your applications should have health checks that alert you if they fail. So what are the approaches we can do this? I'm gonna get into some like quick and dirty approaches that if you can't get a full blown Dedicated monitoring of monitoring system. It's better than nothing So the first instance of that is gonna be a heartbeat so Heartbeat is just one system communicating with another and if that communication doesn't happen for a certain period of time You An alert will fire. this is, again, it's the simplest up down that you could get is just communication between these two systems as it happened. And the downside is that the heartbeat is usually like a dedicated health check. It's not like a, as accurate of a representation as some of the other health checks we're going to go through a little bit in the talk. So a good example would be the, ANAG system with Nagios. And that's just, if the ANAG application hasn't contacted the server in a certain amount of time, it will fire. There's obviously false positive risks with that because your phone could be having, issues connecting to LTE or something like that. And the other problem is that's just shouting into an area saying, Hey, this is broken rather than, doing what they do in CPR training, which is say. Hey, somebody calling, like you point to somebody and say, you call an ambulance or you get a defibrillator. Another example of this is Grafana on call. If you self host Grafana on call, you can sign up for a cloud account. And then if the self hosted version doesn't talk to the cloud account for a certain amount of time, then you get a notification that there's a missing heartbeat. But this only covers the notification system and not everything as a whole. So this kind of works, but you should probably look into something better. if you're using a cloud vendor for your monitoring, then usually they have a status page, and hopefully they're not self hosting it, because if you are self hosting, your own status page, then there's a higher likelihood that both the application and the status page fail at the same time, because there could be overlap in the infrastructure overlap in the humans that cause both the systems to fail, but they're really easy to set up. If you go, if you just like search online for cloud vendors, status page, they'll give you this information and give you like point and click instructions for getting this into email or Slack, but the downside is those are, Those aren't noisy notification system. People commonly will mute, will mute things and it's really tricky to get the settings just right to where your status update, or people asking for status updates don't bother you after hours, but, Cloud Render going down would. And this usually doesn't work for self hosting solutions. A bit of a better version of this is going to be health check services. So if you're self hosting, you could self host something like uptime Kuma, but getting another team to host this inside of your organization is going to be tricky because you have to convince people. that, hey, I know I'm on the observability team, but this team that isn't the observability team should manage the service that does observability things. there's also cloud options for this, but again, if you're self hosting, this can be really tricky because you either have to allow access to your monitoring application, which can be a security risk and getting the security team on board with this could be a problem, or you have to, manage an agent inside of your infrastructure. to, beacon out information to that service as well. But these do hook into noisy notifications. and they do require a bit of extra configuration because it's not just subscribing to a feed or setting up a simple heartbeat. You do have to give a list of URLs and if possible, you should give the correct responses because sometimes you get a 200 HTTP code and a valid SSL certificate, but on that, On that web page or in the response, you say, Hey, even though I'm serving HTTP traffic, I'm not healthy. So if you can configure look for a string in a response, definitely set that up. And again, all of these are minimal context to it's just a simple is the thing working or not and maybe a response time. So the first system that I would recommend. that I would say is an adequate monitoring of monitoring system is to deploy two independent monitoring systems. this has the widest range of quality, so you could do it the lazy way and deploy a smaller version of the exact same tooling with the exact same version in the same infrastructure as code inventory with no change management controls. And this obviously isn't the best solution because. something inside of that region or inside of that Kubernetes cluster could fail. you could find out that there was a regression in an update and take out both systems at the same time. If you use the same notification system and that fails, you, it's really hard to tell. So doing all those things right is also tricky from a bureaucratic perspective too, because you have to convince your boss that, hey, we need to deploy an application in a new region. Or. Another thing is you have to justify of setting up change management processes, which can be difficult and it can be hard to get people to follow the instructions on that. If you do this approach, you should definitely test it. So a way you could test it is to break your monitoring of monitoring system and then make sure that your primary monitoring system is sending notifications. And when you do this, make sure that somebody's periodically just like refreshing dashboards or some tests to make sure that the primary monitoring system is working as well. So that way you can figure out if there's any interdependencies between the two systems. another thing you can do if you're self hosting is purchase a cloud service. And if you're approaching your cloud service, you can either purchase a different cloud service that monitors the first cloud service, or you could set up an on prem system that monitor monitors the cloud service. the upside of this is now you're definitely having different humans manage the systems, and those humans are probably using different upstream vendors. a lot of the time you can control where your cloud service is deployed. And so that way they're geographically isolated. and I know cloud monitoring bills have a bad rap for being super expensive, but because it's one application, it's not that big of a deal. And that application is managed usually by the observity observability experts in your organization. So they're aware of things that can lower costs like react. relabeling rules or streaming aggregation. And this one, you can still misconfigure it. but it's lower because it's a shared responsibility model rather than a, you're responsible model. And the best version of this is to purchase a dedicated monitoring of monitoring solution. So Victoria metrics offers this as an part of some of our enterprise plans. But the downside of it is it's the most expensive because along with paying for a separate monitoring system. So a separate instance of Victoria metrics to monitor your on prem Victoria metrics. You're paying for the really smart humans to enrich the already rich notifications. But again, that's going to be the best experience by far because along with a really well supported observability system. You get the smart people behind it that can help you, work through the issues as well. So in summary, this doesn't have to be super expensive. It doesn't have to be super difficult. And if you can't get like permission to deploy a separate monitoring and monitoring system, there's options that are better than nothing. And you can mix and match the approaches to fit your availability requirements or fit your use case. And the most important thing is we get to answer the age old philosophical question. If your monitoring system falls over in the middle of the night, does somebody get paged at 2 AM? After this talk, the answer should be yes.
...

Mathias Palmersheim

Solution Engineer @ VictoriaMetrics



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways