Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. Welcome to this talk. Today we'll be talking about
reliable monitoring and alerts, everything you need to
know of how to set up your dashboards and
alerts for your cloud applications. My name is
Israel and I am brazilian,
living in the UK for the past two and a half years.
This is a picture of my family. I married to
my wife and we have a son, six years old right now.
Well, from hobbies and things, I love playing games,
traveling. This is a picture of us traveling across
Europe and currently I'm a software engineer at
Meta. Brief outline of what we'll be talking about today
so first I want to touch why we want to talk about alerts
and dashboards and metrics and other things. Then we
go a bit in metrics, how to set up dashboards,
what is interesting for alerts, and how to learn from
failures that will eventually happen. So first,
why is it important?
So we have a question, like are the systems up right now?
Is everything all right? Is everything working as expected?
Are our metrics stable enough?
Or at least are we keeping the expected trains? If you are expecting
to grow every month, is it growing ultimately,
can I sleep in peace? Can I rest assured that
the system is healthy, in place while I'm
taking a vacation, while I'm sleeping, and be sure that I will be alerted
if anything goes wrong?
Well, metrics is all about what do we want to monitoring.
It will be different depending on your scenario,
depending on what you want to do. But look
at this engine room for
ship. I don't know which ship is it, but we can see lots of
different things to check.
The ship is going well if the engine has the right pressure,
the right tuning for everything. And this is important
because when you are in a ship, you want to make sure the ship is
working properly. And similarly, that's what we want for our cloud
applications and our systems. So we
have two types of metrics. First, I would say the business
metrics. They are the KPIs goals,
user behaviors, everything that we want
the product to be, everything that we want the users
to see what's affecting our business.
And on the other side, not detached
from it, we have the technical metrics. So we are talking about response time,
we are talking about error rate, machine constraints like
CPU memory, I O network,
things like that. And for this presentation,
we'll be focusing more on the technical
examples, but it's valid for all of them.
So you can set up dashboards and alerts that affect
both. Now let's dive into
dashboards. All we need to know at
a glance. So dashboards, what they
are, they are a visual representation of key
metrics, indicators, trends that we
want to observe, insights into
the current state of the system,
and it provides us with continuous monitoring
and analysis. Usually we want to see it over a
period of time, basically like the past x days,
hours, weeks, the past month,
fortnite and things like that. And let's
take a look at an example to understand better what's a
dashboard and how we want to look at it.
So this I took from the new relic
website and introduction part of it, neuralic,
is one of the most used ones. This is not sponsored,
so just an example. And here
we can see lots of information in the
dashboard. It can be a bit overwhelming,
but I want to highlight two things here. First,
we have different widgets looking at different things. I already mentioned
error rate, throughput, transactions, time apdex score,
and we have many other things that could be observed.
And we also usually have period
selection here, 30 minutes. But sometimes you want to see
if something is stable based on differences anality.
So for example, depending on your user behavior
or your system distribution, it might be worth
checking, for example, weekly. So you would see
daily going up and down, for example,
depending on if it's day or night. And that
could help identify if a system behavior
corresponding to what's expected to that seasonality.
Okay, now that we saw a dashboard, let's talk about
some useful visualizations for dashboard. So first
one already mentioned is a time series. We want
time series to display trends over time,
and it's easy to spot
and see spikes and seasonalities.
So it could be weekly, daily, hourly,
any type of things. But most importantly, when you look at
it, you can clearly see when it's going up or down and
compare with the historical data. You have to check if it's
something that's expected, it's something that's weird
and not expected. For example, let's say you have error rate and
it's usually stable at 0.1% and
suddenly it jumps to 1%. I mean, it's a ten
time jump, it's a significant jump, and you definitely want to take a look at
it, especially if you know, for example, a version has
been rolled out and it could be that this version is
introducing some new errors in your system. Maybe it's even time
to stop the rollout. Another thing,
it's also like usually a time series but stacked area
chart when you want to see the cumulative contribution to
a total value. So let's say for example, you want to monitoring the
percentage of users using your application through
different platforms. You want to see, for example, mobile, Android,
iOS, the website
or mobile website, depending if you wanted to make this difference.
And you can see it over time if it's decreasing and increasing
something. And you could see,
let's say for example, you have a critical bug on your Android application,
and you would see that the Android version percentage
in the chart drops,
disappears, or something like that.
There are many ones, but there are bar charts, pie charts,
heat maps, scatter plot, cloud charts, histograms, waterfall charts,
but don't get overwhelmed by it.
We need to know how to read those charts, we need to know what they
mean, and most important, what action we
want to take from them. So if you have a chart that's showing
a deviation, something weird that you can't interpret,
that's probably useless. So be simple.
I highlighted two of the main ones, but be simple,
understand this, understand what metrics, remember,
we need to define what metrics we want to observe.
And some of these tools will probably already give
predefined metrics to follow. And as we
get used to it, as we think, hey, I should have observative
lead for some specific things. It might checking
other things. So for example, bar charts, I usually check to
see the amount of a specific error
in the error rate, so the contribution. So let's say we have an increase in
the error rate, and I quickly look at a bar
chart for the past hour to see what's the error that's
more frequent, and we can quickly find
the issue, or at least quicker find the issue.
Important about dashboards is like plot
useful guidelines. So sometimes you have what's the value expected.
So let's say, for example, you expect the error rate to
be between 0.1 and 0.2. So maybe you
want to have like two lines plotted in the dashboard
to help you see if it's quite
close to one of the thresholds or not.
Also plot useful limits.
We'll talk about alerts
later, but think about something that you want to be alerted
by. Maybe it's worth also bringing this to a dashboard.
Also have handful filters you
might want to filter specific surfaces animation.
Let's say you identified through the bash dashboard that the
Android application has increased error rate.
So you might want to filter. Okay, now show me only
events coming from the dashboard from the Android application,
and then you can check, oh, what is happening,
when did it start? And quicker find
your way through it. Another very important
thing is that dashboards need to be actionable.
So if you have,
let's keep the example. Error rate. Error rate is up.
Okay, why? So which errors are happening?
And. Okay, show me the stack trace. I found that this is the
most common error. Just click through it, go to the stack,
trace the bugs, too. So it's important to
have easy access to drill down
specific metrics, specific widgets that we want.
But be careful. It's also ideal not to
build like a too overwhelming dashboard because
again, this is something we want to take a look by,
a quick glance, understand the status of the system,
and if it's too complex, too overwhelming, too much information,
we'll end up losing important things instead of
having them clearly jump into our faces.
That's it about dashboards. Let's talk about alerts a
bit. So, alerts, I like to say, like, it's because we
all deserve peace of mind and they help in that. So what
are alerts? A good example is like the cuckoo clock.
So at specific times it just says, hey,
it's 05:00 p.m. It's 04:00 p.m. It's 03:00 p.m. Or something
like that. I don't know why it got reversed, but you got it. So alerts
are basically automated notifications. But of course you don't want to
be notified because it's 05:00 p.m. But you probably want to be notified that
the error rate spiked by ten times. And so
we basically define what we want to observe.
What metrics? What are the thresholds
you define? Okay, above this, alert me.
Send like a message to my phone, or if it's way above it,
call me, or something like that. And the idea of alerts
is that they prompt immediate attention. It's usually targeted
to specific individuals or teams. So some companies I
know, they have a team responsible to monitoring,
so they are usually like the first level of monitoring.
They would receive an alerts, look at dashboards, identify which team
should be our individual, should be escalated
to, and then involve them. But most of the companies don't have that.
It's just like tied to specific people or specific teams, depending on what
the alert is, and then we can action. So alerts
are very important to keeping us
up to date to anything that's seemingly
going wrong, not only going wrong.
So I will explain that, too. So, alerts for peace of mind.
If you want to relax like this person in
the picture, usually we want to be alerted
only by critical metrics. So if the
alerts are too noisy. By noisy, I mean
you get alerted every five or ten minutes or every hour or
even every day, you will probably end up ignoring them. And that
completely misses the point of the alert. So we want to be alerted by
the critical metrics, be it business metrics or technical
metrics. We want to have clear severity
levels. So if it's a critical failure, let's say
the system is not responding anymore or your main page
is out, it's just returning 404.
We need to understand that.
And ideally we have different levels
of alerting for that too. But also we want to have alerts
selecting up, allowing us to have early visibility
of possible issues. So I remember once in
a company, we almost had the database
down because it was slowly,
I mean, not too slowly, but too fast either.
So the disk was getting full, basically,
TLDr disk was getting full. And the problem
is that it was getting full in a pace that
avoided the alerts for sudden movements.
But also it was quick enough that it
would happen over weekend. And on a Friday,
people found out, okay, we have a problem,
but if we had an alert, for example, oh,
every time it trends, like next week,
we'll be out of storage. It could
save us some headaches in
the future. And similar to dashboards,
we want alerts to be actionable and clear.
So if you receive an alerts, we should know exactly what
this alert means and what we should do. What we should do could be.
Okay, let's take a look at the dashboard specific dashboard.
Understand this more to see if there is a problem. What's the problem going on?
But also, is there an escalation path? Is there someone
I should call or someone that should be
working to solve this issue?
And on the other hand,
if no alert was fired, it should
also be a signal that everything is right. Of course,
we sometimes miss observing
something. We'll talk about it later. But an ideal
alert setup would be everything critical is monitored. And if
no alert is open,
it should be fair and okay to assume everything is all right.
Okay, so moving then to the last part, how to learn
from failures. So we can set up dashboards, we can set up alerts,
but things will break, things will happen.
And this is the art of asking why problems
happen. This is a very simple one. I tried
to look some accidents or things, but I thought they
looked like too scary and they triggered something for someone.
So, okay, let's take a simple exam. Just a flat tire,
problems happen and maybe this
is fine. No, this is probably not fine in the
sense that we need to act. It's fine in the sense that we
don't need to panic. But we need to act. If you keep
the flat tire, you can't continue with a car,
at least not for a long distance at least.
Okay, how can we learn from failures? How can we investigate,
make the right questions, and improve our
system, improve our observability,
improve our alerts, dashboards and monitoring?
So first we have to ask, why did this incident
happen? And usually it's not a matter of
asking why, receiving one answer and be
happy with it. So I have an example that I love.
Well, let's say the server was shut down,
and let's get out of the cloud thing because it
gets things more complex for this example, but let's say you have a
data center and suddenly the server was shut down.
So you check there and, oh,
actually the power is out. Someone unplugged it,
or it was unplugged. So you just plug it in back and server is back.
Okay, that's fine. Problem solved, right? Yeah,
problem solved for that time. But we also have to
ask, why was it unplugged?
Did someone pass by and accidentally unplug
it? Did something happen and someone,
by panic unplugged it? Or maybe like the
janitor unknowingly knew what was happening
there, unplugged it to clean it, and forgot to plug it
back. I mean, there could say several reasons, and we need to keep asking
why this happened until we get to the root cause. This is
the first part. And then we have to ask what could
have prevented it? Is there any alert that could
have triggered. Is that any alert that
actually showed us a problem? And then is there any alert that could have triggered
earlier in case of the server? Of course
not. But maybe there would be an alert in the sense that,
hey, someone got access to the server room. Is it okay?
Is it an authorized person? Okay, I'm probably extrapolating
the example here, but just trying to give some hints and ideas.
We also have to understand, were we caught off guard,
how can we detect this earlier? And so on. So every
error, everything that happens that is significant enough to
be a problem, I think it should be investigated
and a plan should be made to improve
it. We can improve, of course, the system itself,
fix the bug, improve the scalability and
so on. But we also need to do like,
improve the dashboards, improve the alerts,
and make sure that if we get close
to having the same issue, we catch it as
early as possible. As I said,
any missing dashboards, any missing alerts, anything that could have
been better. Yeah. So to wrap up, we talked about metrics
why we should select metrics, some types of
metrics we should select dashboards. We looked at some examples
what we should look how to set up use for visualizations.
We talked about alerts, setting up good practices
for alerts escalations and last we talked about
learning from failures. Some tips to investigate and
improve failures. Thank you. That was
it. I hope it was good for you. Enjoyed it
and feel free to connect with me and LinkedIn.