Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, everyone.
My name is Matthias Palmersheim, and today I want to talk to you about how to monitor
your monitoring and why it's important.
A little bit about me.
I'm a solutions engineer at Victoria Metrics, and Victoria Metrics
makes an amazing time series database called Victoria Metrics.
We also make some utilities for getting your data not only into Victoria Metrics,
but other time series databases as well.
We are also starting to make a tool called Victoria Logs for aggregating logs.
And when I'm not working on that at my day job, I like to take a lot of the
utilities we make as well as some others and aggregate it together into an easy
to deploy monitoring and logging system.
So that way you don't have to have a giant Kubernetes cluster or an expensive cloud
service to get started with observability.
So if you couldn't tell by that intro, I'm a huge observability nerd.
And Actually, on my first date with my wife, I wouldn't
shut up about observability.
And there was a betting pool in the office to see if I would talk
about observability the whole time.
And the reason why I became obsessed with observability is one of my first IT jobs.
It was a IT manager at a factory.
And when I got there, I didn't really have any monitoring system, so to speak.
And what would happen is an application would go down because I didn't have
a monitoring system to help me.
Prevent that from happening in the first place and get alerts for things like,
Hey, hard drives filling up and the hard drive would just fill up and an
application would fail and my alerting system at the time was users coming into
my office and informing me that Hey, this change that was made recently took
down the application and now the whole office can't work and that's costing the
company tens of thousands of dollars an hour and that caused a lot of stress.
So along with not having context as to why things were broken, I had to deal with the
stress of having somebody there reminding me how expensive this problem was.
So I could get nice automated notifications instead of
angry users in my office.
I implemented a monitoring system that collected the telemetry and that
telemetry could tell me that, Hey, this resource is overused, or this
thing is being slow before it finds me.
failed and although I was angry that now I had alerts, things
weren't going down as often.
And, when people would come into my office after they broke, I would usually
already be working on the problem and that helps lower stress a ton.
but I had a problem though.
A lot of the notifications I was getting from my monitoring system
were coming through the same channels as things like status updates or.
people, just like general announcements, Oh, Hey, you need to
park in the other parking lot today.
And so that led to a lot of alert fatigue because every time I got
an email, I got the like stress of, is this just a simple update that
I needed to park somewhere else?
Or is this a tens of thousands of dollar an hour problem?
So I looked into noisy notification systems that were
dedicated for critical alerts.
So it was super easy to set up.
D and D override in my phone.
And the other nice things about this is that it would make
an individual responsible.
So instead of just shouting out, Hey, the system's broken, somebody should fix this.
It would assign somebody to the, incident.
And the way I decide if something is an important.
notification and it should be able to override like the notification
prefaces on my phone is that if it costs life, liberty or property.
So if it could cause physical harm, if the system is down, if it causes compliance
issues and the government's going to get involved, or if it can cause loss
to property and property could just mean that it reaches a certain cost threshold
and it's losing the company more money than you would like for an outage.
So I was starting to feel pretty happy about my existing observability system.
I was able to get noisy notifications when things were breaking instead of email.
But then I had a philosophical question.
Is what happens if the monitoring system fails?
So if my monitoring system fell over at 2 a.
m., does anyone get paged?
So this is the problem that I was running into where the
monitoring system would fail.
And I would think that everything's fine, whether or not the applications
behind or that the monitoring system was responsible for were up.
And to solve this problem, you just deploy a second monitoring system.
So you set up a monitoring of monitoring system that only monitors
the primary monitoring system.
And then from the perspective of the primary monitoring system,
you It's just another application.
So it's just like the ERP system.
It sends metrics and it alerts you when things aren't going as expected.
There's a problem with this.
The monitoring and monitoring system is also just another application can
fail and multiple applications can fail at the same time for whatever reason.
So this problem could be solved.
Just be solved by adding more and more monitoring systems, but you never quite
get to 100 percent availability you get more nines but with more nines of
Availability you also get more costs and it's harder to maintain the knowledge
if you have all those monitoring systems so usually the sweet spot is two But
you don't just deploy two applications to the same region and in the same
infrastructure as code inventory You need to make sure that the applications are
deployed in a way that they're isolated from failures as much as possible.
And before I get into all the ways that this can go wrong, I'm not implying that
you and your teammates aren't smart.
I'm saying that humans aren't perfect.
We make mistakes, but Usually different groups of humans don't
make mistakes at the same time.
So you want to make a different group of humans as responsible for your monitoring
and monitoring as much as possible.
And another thing you need to do is to make sure that your change management
processes are aware that these two systems shouldn't be updated at the same
time because upgrades or configuration changes are frequent source of outages.
So if you touch both systems, if you're allowed to touch both systems
at the same time, then there's a high likelihood both fail at the same time.
So the different tools that we need to make are the different technical ways that
we can prevent both systems from failing.
are to use different notification services because again, notifications
are apps and apps fail sometimes.
you want to make sure that they're in different infrastructure providers
because usually, different infrastructure providers don't have simultaneous outages.
If you can't get it inside of a different infrastructure provider or a different
cloud service, at least try to get it inside of different regions and separate
deployments within the same cloud service.
So in summary, your monitoring and monitoring is another monitoring
system, but it's only responsible for the primary monitoring system because
it gets really confusing if you have different monitoring systems available
for certain business applications.
It can be harder to test the system too, because not only do you have to
get the okay from your boss to test and intentionally fail a system, you have to.
Get approval for a test that could have impacts to other teams as well.
And the monitoring and monitoring system is just treated by the monitoring
system as another application.
So the minimum requirement is to do the most important thing in
observability, which is make sure that your applications are available.
This is going to be cheaper to store and easier to configure, because in
most cases you're just configuring a connection between two things or
feeding a list of URLs to some service and it's performing health checks.
But the downside is that you don't get the, you don't get a
lot of preventative alerts unless like the applications being slow.
and so what happens if you don't get those contextual things is that when
the application goes down, you have a much higher mean time to resolution
because you have to figure out what went wrong as well as fix what went wrong.
And again, this is responsible for keeping the most important
applications in your business online.
So you really should.
treat it like another application and give it those preventative measures, give it
really nice dashboards, have run books and all those other things that every
other application in your business gets.
And even if you are treating it like another application, your
applications should have health checks that alert you if they fail.
So what are the approaches we can do this?
I'm gonna get into some like quick and dirty approaches that if you
can't get a full blown Dedicated monitoring of monitoring system.
It's better than nothing So the first instance of that is gonna be a heartbeat
so Heartbeat is just one system communicating with another and if that
communication doesn't happen for a certain period of time You An alert will fire.
this is, again, it's the simplest up down that you could get is just communication
between these two systems as it happened.
And the downside is that the heartbeat is usually like a dedicated health check.
It's not like a, as accurate of a representation as some of the other
health checks we're going to go through a little bit in the talk.
So a good example would be the, ANAG system with Nagios.
And that's just, if the ANAG application hasn't contacted the server in a
certain amount of time, it will fire.
There's obviously false positive risks with that because your phone
could be having, issues connecting to LTE or something like that.
And the other problem is that's just shouting into an area saying, Hey,
this is broken rather than, doing what they do in CPR training, which is say.
Hey, somebody calling, like you point to somebody and say, you call an
ambulance or you get a defibrillator.
Another example of this is Grafana on call.
If you self host Grafana on call, you can sign up for a cloud account.
And then if the self hosted version doesn't talk to the cloud
account for a certain amount of time, then you get a notification
that there's a missing heartbeat.
But this only covers the notification system and not everything as a whole.
So this kind of works, but you should probably look into something better.
if you're using a cloud vendor for your monitoring, then usually they have a
status page, and hopefully they're not self hosting it, because if you are
self hosting, your own status page, then there's a higher likelihood that both
the application and the status page fail at the same time, because there could be
overlap in the infrastructure overlap in the humans that cause both the systems to
fail, but they're really easy to set up.
If you go, if you just like search online for cloud vendors, status page, they'll
give you this information and give you like point and click instructions
for getting this into email or Slack, but the downside is those are, Those
aren't noisy notification system.
People commonly will mute, will mute things and it's really tricky to get the
settings just right to where your status update, or people asking for status
updates don't bother you after hours, but, Cloud Render going down would.
And this usually doesn't work for self hosting solutions.
A bit of a better version of this is going to be health check services.
So if you're self hosting, you could self host something like uptime Kuma, but
getting another team to host this inside of your organization is going to be tricky
because you have to convince people.
that, hey, I know I'm on the observability team, but this team that isn't the
observability team should manage the service that does observability things.
there's also cloud options for this, but again, if you're self hosting,
this can be really tricky because you either have to allow access to your
monitoring application, which can be a security risk and getting the security
team on board with this could be a problem, or you have to, manage an
agent inside of your infrastructure.
to, beacon out information to that service as well.
But these do hook into noisy notifications.
and they do require a bit of extra configuration because it's
not just subscribing to a feed or setting up a simple heartbeat.
You do have to give a list of URLs and if possible, you should give the correct
responses because sometimes you get a 200 HTTP code and a valid SSL certificate,
but on that, On that web page or in the response, you say, Hey, even though I'm
serving HTTP traffic, I'm not healthy.
So if you can configure look for a string in a response, definitely set that up.
And again, all of these are minimal context to it's just a simple is the thing
working or not and maybe a response time.
So the first system that I would recommend.
that I would say is an adequate monitoring of monitoring system is to deploy
two independent monitoring systems.
this has the widest range of quality, so you could do it the lazy way and deploy
a smaller version of the exact same tooling with the exact same version in
the same infrastructure as code inventory with no change management controls.
And this obviously isn't the best solution because.
something inside of that region or inside of that Kubernetes cluster could fail.
you could find out that there was a regression in an update and take
out both systems at the same time.
If you use the same notification system and that fails, you,
it's really hard to tell.
So doing all those things right is also tricky from a bureaucratic
perspective too, because you have to convince your boss that, hey, we need to
deploy an application in a new region.
Or.
Another thing is you have to justify of setting up change management
processes, which can be difficult and it can be hard to get people
to follow the instructions on that.
If you do this approach, you should definitely test it.
So a way you could test it is to break your monitoring of
monitoring system and then make sure that your primary monitoring
system is sending notifications.
And when you do this, make sure that somebody's periodically just
like refreshing dashboards or some tests to make sure that the primary
monitoring system is working as well.
So that way you can figure out if there's any interdependencies
between the two systems.
another thing you can do if you're self hosting is purchase a cloud service.
And if you're approaching your cloud service, you can either purchase a
different cloud service that monitors the first cloud service, or you
could set up an on prem system that monitor monitors the cloud service.
the upside of this is now you're definitely having different humans
manage the systems, and those humans are probably using different upstream vendors.
a lot of the time you can control where your cloud service is deployed.
And so that way they're geographically isolated.
and I know cloud monitoring bills have a bad rap for being super expensive,
but because it's one application, it's not that big of a deal.
And that application is managed usually by the observity observability
experts in your organization.
So they're aware of things that can lower costs like react.
relabeling rules or streaming aggregation.
And this one, you can still misconfigure it.
but it's lower because it's a shared responsibility model rather
than a, you're responsible model.
And the best version of this is to purchase a dedicated
monitoring of monitoring solution.
So Victoria metrics offers this as an part of some of our enterprise plans.
But the downside of it is it's the most expensive because along with paying
for a separate monitoring system.
So a separate instance of Victoria metrics to monitor your on prem Victoria metrics.
You're paying for the really smart humans to enrich the already rich notifications.
But again, that's going to be the best experience by far
because along with a really well supported observability system.
You get the smart people behind it that can help you, work
through the issues as well.
So in summary, this doesn't have to be super expensive.
It doesn't have to be super difficult.
And if you can't get like permission to deploy a separate monitoring and
monitoring system, there's options that are better than nothing.
And you can mix and match the approaches to fit your availability
requirements or fit your use case.
And the most important thing is we get to answer the age old philosophical question.
If your monitoring system falls over in the middle of the night,
does somebody get paged at 2 AM?
After this talk, the answer should be yes.