Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello. I'm here today to talk about exposing
log metrics to Prometheus with the best practice possible. I mean
industry standard. I mean, there are a couple of ways to play around
moving log metrics to Prometheus, but we're going
to highlight a couple of best practices today and I really
do hope you enjoy the conversation. By the
way, welcome to conf 42.
This is about me. I'm an infrastructure engineer. I work with o
one labs. I work with the velocity team. So I
am involved in ensuring that code developers code
get to production as fast as possible. The name of our product is Mina protocol.
That's my Twitter handle and that is my email, so you can reach out
to me anytime. Well, let's get to what
we have to do today. This is an introduction. In this
age of fastgrowing, advancement in the cloud and
debugging of applications and managing of services in the cloud,
there is very serious need for us to understand how logs
and metrics works. The software engineering field
is not built one way. It's not built for you to just garbage
in and garbage out. Some things have to happen in the process for it for
you to get to where you need to get to. I'll give you an example.
Assuming you need to build a house,
there'll be need for you to do some testing, there'll be need for you to
do load testing, there will be need for you to ensure that the size of
concrete for a certain place is required for that certain spot.
So it's the same thing in software
engineering. It's not a straightforward thing. At some point
you got to understand why some setting things fails and
how you could correct those issues. And that's where log metrics come
into play. I mean, it's a huge part of observability and
it's a huge part of the success of any running application.
So we're going to talk about different things, about understanding your system,
creating postmortem analysis from what you studied in the logs and
metrics, and yeah, several other functions.
There are tons of ways to ship log and metrics of Prometheus. But today our
case study is vector. Vector Dev is a very interesting
product and by the time I'm done breaking down certain things,
you would see the importance of using vector. Wherefore what this
picture is explaining is the server, the pipeline,
and then Prometheus. So we have the server,
which is the blue stuff, and then you have the pipeline where actually
the application runs to and then Prometheus. So in this case, we're talking about
the logs and metrics in the pipeline before it gets to
Prometheus.
Some of the real life edge cases that are concerned with what we're talking about
today is one reduction of total observability cost.
Well, you could say the advantage, but I like to see it as
a use case. And when it comes to observability, cost revamping
or follow up, then I think what we are
discussing today is very important. Secondly, if you
want to improve observability, performance totally, and reliability,
reliability in the sense that if you
have good view of how your application runs the logs,
the communication between the services, then at some point you'll be able to tell that
this is where your infrastructure is and this is where you want to get it
to. So a study of the logs at this point kind of gives you the
opportunity to take it to where you want to get it to. So yeah,
this is another use case, real life use case of our conversation today.
Another thing is transitioning vendors without disrupting
workflows. Workflows in this sense is in
a system. I mean we've got different things that are involved in running our system
and ensuring it gets where it's supposed to get to. But observability
or log metric study kind of give an opportunity
to transition between vendors without disrupting
the real workflow of the application or the service.
So you could change, say services that receive
the logs, you could change different things, you could change different options
on the system just because of the logs. You've seen the
logs, you've seen exactly what is happening. You can exactly tell, okay, this application does
not seem to do what we want. Can we change it without disrupting the
real key workflow or the key function of the system as service?
Running another use case is enhancing data quality
and improving insights.
Observability is about insights looking given an outlook
of your application and if you have insights, you'll be able to tell where your
application is going, where it is right now, and where it used to be before.
So with this you can call it document, excel,
spreadsheet, postmortem analysis and all that.
This is about consolidating agents and eliminating agent fatigue.
Agent in this case is not exactly a human being. Of course I speak everyone
to understand that, but agent here just means something that represents
a certain service. I'll give you an example, say ansible or
you got to install agent for ansible on services or
on servers, rather to be able to ensure that ansible can
communicate with server, you understand. So in this case we're
going to be talking about how to eliminate agent fatigue so you don't have to
stress out any agent of some sort in the service or in
the service. In this case, you'll be able to look
at different options as far as observability is concerned. But this is very
important from the log and metrics that we're going to
get from the application. Okay. So coming
down home, we're looking at, first of all, let's understand the relevance of logs in
SRE. I mean, you understand what SRE means.
SRE is just about maintaining production status in some companies, they're called
production engineers. So SRE
is just about maintaining production status of any application on the cloud.
So we are talking about, let's understand, first understand the relevance
of logs in SRE. Log data
contains stories now, but log data contains information
such as memory exception had these errors. This is very helpful to
help us identify why behind a problem either that a user has brought to
our attention or that we have uncovered. We here in this case is
the engineering team. But what this is saying is the
logs have an opportunity or have. What word
will I use? Logs has it in itself to be able to provide
the communication between services in a server.
It also could tell you if there are memory logs, if there are hattix
errors, if we have memory
exceptions, if we need to increase
the memory, if we need to read how
setting calls C URL are done in the network,
if we need to look at the time it failed.
And in some cases, you may run a certain server
and it crashes at some point. And when it crashes, you can't even get into
the server to check what is wrong. So you could start up the server
again. Ensure that you get into the server when it's running and
then you look at the logs until it crashes. The last line of its crash
could tell you exactly what is wrong with that service or the server.
It could be the image running, it could be the server configuration itself,
it could be memories, it could be whatever. But that last line kind
of gives you an insight of what is happening with the server
and why it's crashing. So that is the core relevance of logs in SRE.
Brian Redmond said this. It was Brian Redmond that said,
but one, that being an expert is more than understanding how a
system is supposed to work. Expertise is gained by investigating why
a system doesn't work. Everything around what brand just said
is tied to logs, to understanding logs
data or log data, rather. So, yeah.
Viewing metrics in SRE, each exposing metric
should serve a purpose. Resist the temptation of exporting a handful
of metrics just because they are easy to generate. Exactly. This is
just saying, when you're dealing with metrics, you don't just ship
everything in the application or in the server, you don't do that. You have
to ship what is relevant, you have to check what does the team need.
Instead of giving us updates on how
the servers are operating every time, give us updates on when it's down. Probably why
it is down. So this
conversation could elapse into alighting and all that. But yeah,
just understand the point that when you're viewing metrics, it has to be exactly
what needs to be viewed. What is going to help you get better
hand of the infrastructure. Metrics is just a part of observability and
that's why you have the picture there. You have traces, you have logs. So all
these combine together to make observability a very successful run.
I said that we're going to talk about vector and here it is. When you
have vector, what does it mean? I mean I should have called this vector the
dev, but the general name is vector. When you hear vector, what does
it mean utilizing vector to expose metrics?
From the top of my mind, vector is just a
product that makes it easy for you to ship logs to
Prometheus. But it catches this. Prometheus does
not accept logs, it accepts metrics.
So vector kind of provides that
pipeline to convert logs to logmetrics and then
take them to Prometheus easily with just a couple of exposure in the servers and
all that. So that is a summary of what vector does. You could
check about them. They have very interesting documentation. I've used them a couple
of times in the past as well. So some of the best practices
we are going to be dissecting is
these are steps in the implementations, but I'm just going to point out
a couple of tiny things that are going to help you when you're initializing vector
to expose metrics. One of it is you need to set
up web server configuration in vector toml t o ML.
You can get this in documentation, it's pretty easy. A couple of five, six lines
should get this running for you. The next thing is passing logs before transform
to metrics. So before vector
does the transformation to metrics, you need to find a way to pass
the logs. You need to find a way to pass the logs.
Remember, Prometheus does not accept logs but metrics.
So if the logs are not passed properly, vector won't be able to transform.
So you need to consider this as well because sometimes I have seen scenarios where
people just want to get the logs, metrics and all that. It needs to be
passed first parse,
and then
the next thing we're going to talk about is effectively counting log,
component and strings. It is this
response that we can imagine from EGOC will be collecting for any observability process.
Now the logs that are going to be converted to metrics
is going to contain so much information, but we need to understand what we want
to see. Do we want to see request status? Do we want to see the
service status? Is a 200? Is it file four? Is it 308?
What exactly in the log do we want to see? So this is about effectively
counting log, component and strings. Vector makes this very easy. You could
say you could get a counter that could count
setting components. How many times did the service fail? How many times did
it count? 200? Is it an unlimited 200 and all those
kind of responses? So this is kind of to skew whatever
I sent to Prometheus. I give you solid information on what you need.
You don't have to get everything all if you just do it, basically everything will
run, everything will go in, but I think it will be difficult to decipher through
the system or decipher through the lot exactly what you need.
So this is about effectively collecting logs and it will
give you proper visibility on exactly what happens.
Explore the Prometheus exporter. Of course, this is general, whether vector or
non vector, Prometheus has an exporter that you've got to use.
Now you can bring in Prometheus into the equation after the URL has
been exposed by vector for scripting, we can use Prometheus exporter sync
feature provided by vector.
That's incomplete, but okay. Vector has an
exporter sync that you could use to work with Prometheus and all that.
Explore the Prometheus script to view on the dashboard.
I don't have to talk a lot about this. I mean, this is Prometheus scripting.
It's just about scripting the metrics. That has been convenient. And then,
yeah, go ahead. So the next thing
I'm going to talk about is, which is the last thing
now is that you need to set actionable alerts.
A well defined alerting strategy can help you achieve effective performance
monitoring. Now, the thing is, I have slightly talked
about this before, but I'm going to talk about it more right now.
Setting up a threshold when you are sending an alert is very important.
I mean, you're just not going to send things to Prometheus, right? For Prometheus,
I got to be able to be alerted, maybe from slack with some sort of
webhook or on discord or on WhatsApp or my email,
whatever the case may be, I got to be alerted
of what is happening on Prometheus. Do you understand?
But you can't alert him on every behavior of the system because
anything could happen. And then probably we have a restart procedure or a
replica set of his kubernetes we're talking about, and then the
node or the port cloud restart itself. You'll be
bugging me. If every time it restarts or every time it needs to create
more the scaling procedures, I get an alert. It will be so
much, it will be overwhelming.
Google called it alert fatigue. So what
we try to encourage, or what I'm encouraging is that you set actionable alert.
Alerts that requires actions. So if we have a
down failure, that's something to be aware of. If we have
a crash loopback error, that's something to be aware of.
Things like this that could not easily be
handled automatically from the system. Let the people, let the
engineering team be aware. So set actionable alerts. Alerts that will require
your attention, not alert, that is going to tell you, oh, this is happening.
So, yeah, you should ensure that notifications are properly configured to reach
the appropriate team in a timely manner. In some teams
you can have on call engineers that could helps in getting this running at
the time they are on call. I think that's the last thing.
There are lots of things to say, but I don't want to bog you with
so much information. Just wanted to get the five concise ways that you could get
this running and be at the top of your game when it comes to shipping
logs or converting logs metrics using vector dev
to prometheus. Now, in conclusion, the good monitoring system pays
dividend. It's well worth the investment to pay to put substantial
thought into what solution best meets your needs and to iterate until
you get it right. The success of a good monitoring system,
the success of observability is tied to how good the team
could sort of logs. I have been in teams where we don't have so much
of the best engineers, but they are so good at debugging and
personal motor analysis and log and metrics and all this kind
of gives them a hand and you think, oh, they are senior engineers because they
could read logs and interpret
what happens to a system, that kind of a thing. So we
must understand the best practices when it cases to monitoring the system.
Just like this conclusion says, a good monitoring system pays
dividends. I think I've come to the end of my talk and,
oh, yeah, gratitude to my core researcher, Edima Mark. My company
overlaps. We have an opportunity to do these things in
real life, and then, of course, the comfort to organizers.
I really do appreciate the time, and I really hope that
I've been able to teach someone a thing or two. Of course,
if you have any questions, you could reach out to me and, yeah,
let us know what we can do. Of course,
you can always reach out to me, like I said here.
So you can send me, tweet at me, send me a message, or send me
an email. I'm always available. I use WhatsApp as well,
but I just thought, why should I add my contact here? But anyways,
it was really good speaking to everyone and I really do
hope that we have a more interesting time. I'm listening to other speakers in
this comfort zone. Thank you. Thank you. Thank you.
Thank you.