Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thank you for joining in my talk title getting more toss with your logs
at the 03:00 a.m. yeah. So before we
begin the talk, let's introduce myself to you
guys. So I am Shankma Kazaria, working at the software engineer
at Zendot, an incident management platform which
helps teams to respond to an incident and resolve it quickly.
So I have several like years of experience in the software development.
Today I'm excited to share insight on effective logging practices
to help you resolve the issue quickly and efficiently.
So let's start the talk.
So what, what we are gonna discuss in this talk.
So we will be discussing what and why of the logging, why it's important,
what are the common login standard views which needs to be
logged and what are the common mistakes people
usually do while implementation and logging. And what are the logging best
practices you can follow. And at the end once you
have the better logging implementation, how, how log
can be useful at three m when the constitution will
present about that at the later part. Yeah. So let's
start. So first basics, we'll start with the basic,
yeah, so we'll start by voltage logging. So logging
is, it's a kind of act of capturing and storing
the events and messages or even generated by
an application during its operation. So it's
kind of like gives you traces like what has happened during
the operation of the application, what all things,
what all the event has occurred.
Yeah. Then the next one is why logging
is very important. So like you, you know what is
the logging then why it's important. So logging is important in
the scenarios like when you're double string
the on like ongoing production issues. And so all
you like Mozart and the on call engine all, you know what,
how the crunchy stress and how is fracting situation is to be
in debugging at the night or it like
any, any situation. Then you
like logging helps you in the monitoring the application health and
performance, how the particular end consoling
to a particular request. And then it's useful in
the security auditing. So security, it's kind of
like necessary to all the online systems. So it
helps you to keep compliant to the
auditing and the security like who can access what.
And then it helps you to understand the user behavior
like how users are using your application, how it's growing.
So you can optimize it or you can,
you can plan a better way to improve the user experience.
Yeah. So as I mentioned, like troubleshooting
a producer like in the night or
at the any event time when there is any ongoing
production logs are the only source of truth which tells
what has happened during the particular request. What are the,
you know, what are the operations or what are the function drain and
which code has been like request, like which code
is executed by the request. So it's really helpful.
You'll get the better idea what has happened and what
then. So, oh yeah, then monitoring the
application health, as I like mentioned earlier, like it,
it tells you the efficiency or it tells you how
effectively your particular, your system is working. So what are
the latencies? What are the, like v 99, which is
like 99 percentile of your
response time, like that. So you can monitor like
what, what was before and what was after the
optimize it. And security auditing. It's very critical
part of any online applications then.
Yeah. So now once we know like
why logging is important, we will start about what is the,
what are the components consist of on the login framework
or what are the components are there and then we'll
go and see how we can implement it. And what are the common logging
signatures? Yeah, so in the logging framework there are
like main four components, logger handlers, filters, and format is.
So loggers is kind of like the entry point of the log which
captures the log. So which captures the log event.
And then like that, handlers comes into the picture which,
which is responsible to sending that particular
log event to its destination. So handlers can be like there are
multiple kind of handlers, like a file handler, stream handler which
rise to the particular stream or file then
comes to the filters which filter out the event. So like
you can filter out the particular events. And let's say if
it is the error event then you might need to fill
for some more like contextual data. Then filters
can be useful. Or if you want, if you don't need particular
type of event, then you can filter out at the filter server itself. It won't
go till the log file, then format also formulator
at the end it formats all the log data to
a proper structure with like what are they
required by you? Let's say you want the log data
in the JSON, so it will, it will convert the
raw data formatting to the JSON format
data. This is what the like sample
or their demo framework of the logging way like what
are the Python program or whatever the language you are using.
It generates the log message comes to the logger, logger then
comes to the file and file which will write the logs
to the file. Then it goes to filter which filter
out the particular events as per your logic as per your need,
then go through your format which will form in the
particular, in the particular format and then it will
translate login file. Yeah.
So now more, once we know the what is logging and
what is the login framework, we'll move ahead with the what
are the common login scenario. So as we discussed,
what are what for what are the
purposes you can using the login. So make sure that we log
the event, particular event. So if you want to troubleshooting or
debugging it for debugging purpose, you log
like this particular event
has happened. So in the example as I showed like logger debugger,
this is the debugging message where you write your appropriate message at appropriate time
based on the what are the function it's in. Then you
write the logger for arrow handling where like you,
you are expecting some error or exception
in the particular code block. Then you put it under the try and
catch. Then in the cache or in the when you like case
the error, you write up unlock saying that this operation has
been failed. And then you attach the particular
contextual information. So if the time of debugging can
be useful and the other is a performance monitoring.
So where you want to monitor
a particular endpoint, a particular function blocks or performs
a particular function, you rightly start time,
then perform a particular operation and then put an
end time and log that particular like start time and end time.
So you'll get to know how much time is taking to perform a particular
code block. And the other thing is security
auditing. So as I mentioned earlier, it's very like important
to like maintain the security compliance and all this.
So where you monitors like this particular user is logged in
from this particular user like ip address or this
specific user has performed a particular action.
So, so by with the login,
security auditing login, you adhere to the compliance and
you can like while auditing you can properly show them with
user search and watch what all things. Yeah,
so now like we know the,
what are the common logging scenarios, we move on the common logging metric.
So this is uh, in this particular point we are gonna
talk about what are the basic or what are the common
mistake logging mistakes, uh, like a
developer or like what are like also implementing the
logging does. So one of the biggest message,
ignoring the context with the log message. So while uh,
write the message logins, uh said the uh, operate this
operation has failed. You like what implementation. When you
are a, in that frame of mind which you know
the logic, you know the what, what this function is
about, you you tend to miss the context of the data.
Like you darling generally thinks like it's a
very common issue. It's, it's a generic message.
Anyone can understand what, what this message about.
But once you, like, once you change
the production, once after like 1015 days or like some,
after some time when you start debugging that particular issue,
you don't know like what is this message for?
For which request is linked to what,
which user has faced particular this issue? It's just logging,
it's just taking up your storage. This log is
kind of no useful without the contextual data. So uh,
yeah, generally it tends to like not
during the proper contextual data, then this log is
of no use. Then another thing, it's like
setting up, not setting a proper log level. So while writing the
login generally,
generally like what, what it feels like,
you write the all the logs with the same log
level. Like as I mentioned in the example,
like all the, all the log is right
and that is error. So it can be confusing,
like it's a debugging event or it's like a particular
operation has been successfully completed. There should be info or
debugging. It shouldn't be like error, otherwise it will
create the confusion while debugging the particular issue. Because you
are getting the multiple events as an error log
so you can't filter out the particular
error events. And the other thing is like by mistake
without the contextual data. You market the
particular log as like
warning, but it should be a error. So that is also
happened. So while like what are code
review or bi review? Someone needs to point out
that all developer has to keep in mind all the contextual
operation has been performed before and after that particular code block and
write the logging level as a proper
login level. Yeah, one another mistake.
Generally we were doing the
public log with the contextual data, but that contextual
data of the unstructured data. So what,
what I mean unstructured data is like in the
some lobe they have like a particular,
let's say function name, user id and IP address. But in
the very next or in the later part of the code they tend to miss
that particular info in that or
in the other place. Let's say in the example when
I'm writing the log for the first operation I am mentioning the
function and user id and ip in the log context.
Then in the second operation I'll mention the file name
and then line number in the message. I'm not mentioning the ip
and user id there in the second, so can be
misleading unstructured the data. So like when rmb,
let's say I want to find what are the operation performed
by the user id one. So when I search with the user id one,
I'll get only the operation one. I won't get the operation to because there
is no user id field in the on load.
So kind of unstructured data, you won't get the,
all the, all the events or all the operation
performed at the particular user id or what are the,
as per your requirement. So on logging should be in the structured
way, all the details should be there in the old place
value mismatch for SNK. So this is also kind
of like a repetitive mistake that generally
found out one login. So let's find the example.
I have the proper sub contextual data,
like with the proper key, but in the value of
that particular key is type of that value is
different. So let's say in the line, uh, key, uh, value in the first
operation is 15 in the integer, but in the second and
15 is in the string. And similarly in the
data, the first log has the dictionary order JSON value,
but in the later, like in the second operation it's the uh,
string value. So um,
what's wrong in this is like uh, whenever like
whatever the software you are using, what are the open source application or
whatever the things you are using for visualization code
will reject the second record stating that it's not the
appropriate, uh uh,
like appropriate type of value.
So in the sum record is coming as integer,
some record is coming in the string. So it will tend to
reject the mismatched type of logs.
So you like, while visualization or
while fetching out the logs, this particular engine in
our like list of logs. And it's kind of,
kind of misleading that you tend to think that this particular code
block is like particular request hasn't
touched this particular code block. So, which is kind of not true.
So while logging, be careful
about the value, type of tech type of
the value. And now since we
have like discussed what are the common logging like mistakes
like, we'll see what are the best practices. So you
don't miss out any dogs. So I will
like first decide what to know. So you
decide what are, what's the proper necessary
or message, which someone else can make sense.
So generally people write the like vague messages like request successful
operation, successful operation fail. It should
be writing the proper logging method so
someone else can make sense out of that message
and they can get the idea what has happened wrong or what has happened successfully.
Like that and include all the necessary metadata.
So in the necessary metadata, meaning the contextual data,
what the, what's the function name where this particular
error occurred, the line number and what is the exact error occurred
and all, like what's the username,
what's the like, users is a name who's like making this particular
request, what's this IP address and all this. And so you can get,
you can get the best out of the logs and I recommend
to use one trace id or any correlation id.
So whenever you write a log for that particular request,
include the trace id to all the logs during that
request cycle. So like whenever you want to fetch, whatever events
happen during the trigger cycle, you just need to get the
trace id. And once you search with the test id, get the list
of all the events are for that particular request.
So that the correlation key, it's kind of very useful because
it will help you correlate the matrix log.
So in observability there are like,
it's consist of three main things,
matrix logs and traces, right? So it's,
and if you are have the micro service architecture,
then it's very much useful to correlate the,
or to get the traces of particular requests
like which all micro services went and what are the operation happening,
particular micro services. Then you can, you can link
those events and get the most details how out
of that particular event. And then it's like,
as we talk about, like include all the necessary data or include
more data. It's not like you include all these data which,
which is USD access. Also you need to decide what
are the data needs to be logged, what not least to be dope
in login data. Like just keep in mind that
you don't load the PI data which is like a credit card
number, someone's like id password,
some, something like PI data. So just keep in mind you don't log
the some like PI data. And while writing
define the proper log with the uM, like just
keep in mind the contextual place of that particular code block.
And then as we discussed, like why, like how people are uh,
keeping structure like a log with unstructured.
So keep the star, keep the structure of the
log defined and stick to them. So your,
you include all the necessary details, what you have like
decided to have all the places. So it will be you
easy for you to like debug the concentration
and what are the benefit of the structure log. So like once you have the
structure login, you can even better visualize the all the
events, how many. So you can better like visualize
and send like in the time frame or particular like
1 hour to r how many events are occurred,
how many? Like you can maybe land graph, histogram or
hit map done like that. So you get the better
visualization. Uh, once you have the meta visualization and the
proper login, you can see the
logs pattern and you can set
the proper alerting on top of the panel loads.
So like you can set a alerting,
let's say alerts me. Particular events happen in the particular time frame
it now. So like you get to know if
particular events are failing continuously, then you
get to know why it's failing. And now you can debug and fix it as
soon as possible. And unless you are getting the timely alert
and you have the proper login, your incident response time will be
because you will be able to handle the incident
quickly, efficiently. And proper
logging doesn't only like, enables or help
tech team to deal issue with fastly, but it also
improves the collaboration with the other teams. Let's say the
growth or business thing wants to know how many insert, how many
events are occurring like
in the particular time frame. So let's, I give you example of my accounting.
So as we, as I mentioned, like we are the incident management tool.
So it's kind of useful for us to know
like how many incidents are getting created in 1 hour or
130 minutes, how many instances are getting created for a
particular user or particular customer. So, and we can publish
this data to the self or grow team to look
out for, and from there they can make
their analytical decisions and how to go forward,
how to, how to run campaign or what
are the other things they can watch out for based on the
particular matrix. And then what
are then after that, once we
have the structure data, then we need to decide what are the scenarios you
gonna know. So it's not like you, you are
gonna know each and every scenarios with the each and every message. That's kind
of like information order.
There are two log or too many logs.
It's also a very like a bad thing because,
because you will be get lost while debugging
from those load. So decide what are the scenarios you're
gonna load. So in the example, as I said,
like whenever you are making the require, like it's
just for example, when I am like making a request, I'm catching the timeout
error, writing the appropriate message for
that particular error. Then in the second I
may accept the connection error and then
writing the particular message for that particular error.
So instead of this,
I can do it like try catch the, all the errors in the single exception
and I can log the old, I can log
the message that request failed with the data. So that particular
time like I couldn't get to know
why request failed. So instead of
writing like this and one of like what I recommend,
it's like don't,
don't, don't try to catch the like get
exception, see what the
code block about. Then try to think of
all the exception that that particular core that can
occur and that particular code block and then catch
each, each individually write the appropriate message.
And if you don't know what can't go wrong
in that particular code block, it's better to leave it,
don't catch that particular assessment, let it fail. Let your application
monitoring generate
another, generate an incident and let you know this particular things
went wrong and then you come and fix a particular issue. Otherwise once
you write the blanket accession you don't know what is going
wrong in particular situation and you kind of overlook
the potential,
potential bug or something which impact your
company technically or financially. Anything can be
possible. Yeah. So and one
of the, one of the most important
is like how you storing the law logs.
So load storage also one of the important thing because like
it shouldn't be too less or too less or
too much. So let's say you are
storing the log for seven days or like past seven days or
past ten days. There might be if the,
if you need that past 1520
days of later then you need to restore and restore
the older logs. And in that situation just keep
in mind of the log retention period as
per your requirement or how like how many past days of data
you need at the time. Then once you after
that particular time period move into cold storage. So it
will help to save the cost as well.
And then
add the, in a certain, like add the, like what
are the keys or what are the,
like what are the information you
are using to what search a particular log.
Just make note of that and add
the indexing of that particular information
on that particular key. So it the searching will be a faster
and take the makeup of the data. So like you don't
lost the data in case of any disaster.
So and alerting. Now we have the logging proper login
setup. You can set up the alert roles. You can set up the alerting for
a particular event. So like, but let's say
event a occurred 1015 times. Then you will get the
know, you get the on call engineer will be notified for
that particular alert. Saying this event is the particular
event is getting failed or in error while performing
particular operation so they can see and quickly fix the
issue. And then once we have the login, you can
better visualize it. Once you have a better visualization out of
it, you can, you can write
the alert for an only detection. Let's say usually
your traffic is ready like 30 40 requests
per second and one, and then suddenly,
and there's some particular point of time, there are 100
5200 requests per second. Then it's kind of an ugly.
So there might be out of like due to the needle
someone is doing on your website. So at that particular time
you can set up an only detection alert
rules which based on your requirement and then
it helps you to reduce the alert fatigue.
So let's say like you are,
like you keep the threshold as a 50
requests per second and then you, your application has
grown, grown well, you are onboarding more customers and
all these things. Then your request, now generally
your request, like 50 60 requests per second. Then you can update this
result. And because it's general, it's not the, out of
ddos, you are getting 56 requests. So at that
time you can update the threshold and, and you
can get the proper alerting. And now we
have the like a proper alerting setup with the proper logging,
with the proper contextual data and the proper visualization,
how it can be useful at the three a. So mostly once
you have the proper logging structure and all the visualization
are there, most of the scenarios which you anticipate
can be, can go wrong. You visualize it properly,
so you get the idea of the situation or get to
know like why it's happening from the visualization itself.
You don't need to go to logs and searches
for each and every request. And so like basically
you can find the anomaly from where the visualization itself.
If, if particular errors occur,
occur out like no out. If you
are expected error bounding, let's say like you are not, not expecting
particular data or you want to debug the specific request.
And then you can use this like
trace id instructors in the log. So as
I mentioned previously, trace id or a correlation id, you should
log as part of contextual data in the every law. So let's
say you want to log a particular request, what all happened
during the request second and what like where it went wrong and all the sense
you can get the trace id of that particular request and search
it in their application log. And you will
get the, all the events, all the events
related to that particular trace id. And then you get to
know like what went wrong? So this is the example I'm
drawing. So in the one of our thing we verify
the contact number we by sending the OTP. So in
order we have logged the all the events as occurred.
So let's from the start where API invoked.
Then whether need to be verified or not, whether it's numbers
already verified, whether the user is in the block
list or not due to like some issue
with the telecom provider. And then, then we check
with the telecom provider. What surprising to sending
messages or phone calls to that particular number
whether this particular country is allowed from our side or from
the network provider side. And then once all the checks
pass we, we send the OTP verification
code. So from this kind of login you'll get to know
what are the events happen successfully and over. Like if
something goes wrong then we could have logged like
uh, this particular check space. Not changing the or
no, not changing the automatic verification log. So then when
the customer comes and saying I'm not getting the verification code, then we quickly
check that particular request and then we can tell
them like why it's like what are the like check is failing and
what needs to be done from their side or for our side or based on
that we can reach out to network provider file if it's
failing from their side.
And as we have the like
a logging setup and too many logs are firing and
storing in this storage. Like it can be
like if you have too many logs you can get
lost in the old logs. So what
generally in the big organization do
they run the two parallel stream, one for the info log and what for the
debug? Debug log. Debugger level logs.
So in the info log consists of the info warning and critical
error logs. And debugger logs holds all the logs
with, with the debugging level of verbosity.
So generally developer has access to info logs
and the debug level have access like develop access
to the SRE or DevOps guys or someone
who's managing the operation. So like when the issue
comes, developer tries to debug with the info log.
If that particular if is not make the most sense
out of that, then they ask for a debugger access and to,
to get the like to fix the issue
or to get to know what is the issue. And similarly this
is like example shows like you can
write the particular log level in the particular file
or you can write the logger based on the module. So it's
up to you how you want to, like how
you want to log and yeah, but be
careful you don't get lost in the logs. Don't log every
information, don't log like each and every event.
But look what are the necessary events.
Yeah. And how you can achieve
this particular blogging. So there are like many applications
are available, let's say like new relatable
open telemetry, all are there and you can use
any one of them. They have their own pros and cons.
And I have listed the three most useful like
used open source applications.
One EFK, which consists of three, three open source tool,
Rastrixus, locust, Kibana EFK which is like
instead of locksteps you are using the flower fluently
and the graph analogy. So what is VFK?
So elasticsearch fluent in Kibana.
So elasticsearch is the log storage and where
you can search the particular things quickly.
Fluent is the load connector and collects it, connects the log
and push it to the elastic search. And Kibana on top of kibana on
top of can visualize the logs and you can,
it's kind of will provide you the visual UI or
visualization on top of your logs. The form unlocking
is also the same. So there are only the
components are changing. So instead of flowing bit or
you are using on tail from tail here, then low key as your
local vector and the graphene as your visualization alert.
And you can also use the alert manager to
like to set up the alert on top of it. Yeah.
So that end from my side.
So hope you get the idea
how to set up a proper logging and how it can be
useful during your transition. So yeah,
if you have any question, do let me know.
As I showed my LinkedIn QR code
in my home islet with you can connect with me and we
can get go from there. Yeah. Thank you guys.
Thank you for being around for during my talk.
Yeah, thank you.