Getting the most out of your logs at 3 AM

Video size:

Abstract

Logs, we have all found ourselves knee-deep in them at 3 AM for issues with questionable customer impact — annoyed yet cognizant of the fact that deep down they somewhere hold the solution to our fleeting problem.

Join us as we discuss the complexities of asynchronous logs for streamlined analyses, showcase how adding context and trace-IDs to your logs can immediately halve your debugging time, and explore best practices for log analysis when you get a pesky ‘critical’ alert at 3 AM.

Let me know if you need any further details. I’m confident the talk is something that’s going to resonate well with the SRE/DevOps crowd that is passionate about getting the most out of their systems and processes.

Summary

Shankma Kazaria, working at the software engineer at Zendot, an incident management platform. Today I'm excited to share insight on effective logging practices. We will discuss what and why of the logging, why it's important. And what are the logging best practices you can follow.
logging is a kind of act of capturing and storing the events and messages or even generated by an application during its operation. Logging helps you in the monitoring the application health and performance. And then it's useful in the security auditing. It's very critical part of any online applications then.
In the logging framework there are like main four components, logger handlers, filters, and format is. Loggers is kind of like the entry point of the log which captures the log event. Filter also formulator at the end it formats all the log data to a proper structure.
Make sure that we log the event, particular event. If you want to troubleshooting or debugging it for debugging purpose, you log like this particular event has happened. And the other thing is security auditing. It's very like important to like maintain the security compliance and all this.
One of the biggest message, ignoring the context with the log message. Another thing, it's like setting up, not setting a proper log level. By mistake without the contextual data. You market the particular log as warning, but it should be a error.
While logging, be careful about the value, type of tech type of the value. I recommend to use one trace id or any correlation id. Once you have the structure login, you can even better visualize the all the events. Proper logging also improves collaboration with the other teams.
One of the, one of the most important is like how you storing the law logs. Once you after that particular time period move into cold storage. With the proper logging, with the proper contextual data and the proper visualization, how it can be useful at the three a. m.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Thank you for joining in my talk title getting more toss with your logs at the 03:00 a.m. yeah. So before we begin the talk, let's introduce myself to you guys. So I am Shankma Kazaria, working at the software engineer at Zendot, an incident management platform which helps teams to respond to an incident and resolve it quickly. So I have several like years of experience in the software development. Today I'm excited to share insight on effective logging practices to help you resolve the issue quickly and efficiently. So let's start the talk. So what, what we are gonna discuss in this talk. So we will be discussing what and why of the logging, why it's important, what are the common login standard views which needs to be logged and what are the common mistakes people usually do while implementation and logging. And what are the logging best practices you can follow. And at the end once you have the better logging implementation, how, how log can be useful at three m when the constitution will present about that at the later part. Yeah. So let's start. So first basics, we'll start with the basic, yeah, so we'll start by voltage logging. So logging is, it's a kind of act of capturing and storing the events and messages or even generated by an application during its operation. So it's kind of like gives you traces like what has happened during the operation of the application, what all things, what all the event has occurred. Yeah. Then the next one is why logging is very important. So like you, you know what is the logging then why it's important. So logging is important in the scenarios like when you're double string the on like ongoing production issues. And so all you like Mozart and the on call engine all, you know what, how the crunchy stress and how is fracting situation is to be in debugging at the night or it like any, any situation. Then you like logging helps you in the monitoring the application health and performance, how the particular end consoling to a particular request. And then it's useful in the security auditing. So security, it's kind of like necessary to all the online systems. So it helps you to keep compliant to the auditing and the security like who can access what. And then it helps you to understand the user behavior like how users are using your application, how it's growing. So you can optimize it or you can, you can plan a better way to improve the user experience. Yeah. So as I mentioned, like troubleshooting a producer like in the night or at the any event time when there is any ongoing production logs are the only source of truth which tells what has happened during the particular request. What are the, you know, what are the operations or what are the function drain and which code has been like request, like which code is executed by the request. So it's really helpful. You'll get the better idea what has happened and what then. So, oh yeah, then monitoring the application health, as I like mentioned earlier, like it, it tells you the efficiency or it tells you how effectively your particular, your system is working. So what are the latencies? What are the, like v 99, which is like 99 percentile of your response time, like that. So you can monitor like what, what was before and what was after the optimize it. And security auditing. It's very critical part of any online applications then. Yeah. So now once we know like why logging is important, we will start about what is the, what are the components consist of on the login framework or what are the components are there and then we'll go and see how we can implement it. And what are the common logging signatures? Yeah, so in the logging framework there are like main four components, logger handlers, filters, and format is. So loggers is kind of like the entry point of the log which captures the log. So which captures the log event. And then like that, handlers comes into the picture which, which is responsible to sending that particular log event to its destination. So handlers can be like there are multiple kind of handlers, like a file handler, stream handler which rise to the particular stream or file then comes to the filters which filter out the event. So like you can filter out the particular events. And let's say if it is the error event then you might need to fill for some more like contextual data. Then filters can be useful. Or if you want, if you don't need particular type of event, then you can filter out at the filter server itself. It won't go till the log file, then format also formulator at the end it formats all the log data to a proper structure with like what are they required by you? Let's say you want the log data in the JSON, so it will, it will convert the raw data formatting to the JSON format data. This is what the like sample or their demo framework of the logging way like what are the Python program or whatever the language you are using. It generates the log message comes to the logger, logger then comes to the file and file which will write the logs to the file. Then it goes to filter which filter out the particular events as per your logic as per your need, then go through your format which will form in the particular, in the particular format and then it will translate login file. Yeah. So now more, once we know the what is logging and what is the login framework, we'll move ahead with the what are the common login scenario. So as we discussed, what are what for what are the purposes you can using the login. So make sure that we log the event, particular event. So if you want to troubleshooting or debugging it for debugging purpose, you log like this particular event has happened. So in the example as I showed like logger debugger, this is the debugging message where you write your appropriate message at appropriate time based on the what are the function it's in. Then you write the logger for arrow handling where like you, you are expecting some error or exception in the particular code block. Then you put it under the try and catch. Then in the cache or in the when you like case the error, you write up unlock saying that this operation has been failed. And then you attach the particular contextual information. So if the time of debugging can be useful and the other is a performance monitoring. So where you want to monitor a particular endpoint, a particular function blocks or performs a particular function, you rightly start time, then perform a particular operation and then put an end time and log that particular like start time and end time. So you'll get to know how much time is taking to perform a particular code block. And the other thing is security auditing. So as I mentioned earlier, it's very like important to like maintain the security compliance and all this. So where you monitors like this particular user is logged in from this particular user like ip address or this specific user has performed a particular action. So, so by with the login, security auditing login, you adhere to the compliance and you can like while auditing you can properly show them with user search and watch what all things. Yeah, so now like we know the, what are the common logging scenarios, we move on the common logging metric. So this is uh, in this particular point we are gonna talk about what are the basic or what are the common mistake logging mistakes, uh, like a developer or like what are like also implementing the logging does. So one of the biggest message, ignoring the context with the log message. So while uh, write the message logins, uh said the uh, operate this operation has failed. You like what implementation. When you are a, in that frame of mind which you know the logic, you know the what, what this function is about, you you tend to miss the context of the data. Like you darling generally thinks like it's a very common issue. It's, it's a generic message. Anyone can understand what, what this message about. But once you, like, once you change the production, once after like 1015 days or like some, after some time when you start debugging that particular issue, you don't know like what is this message for? For which request is linked to what, which user has faced particular this issue? It's just logging, it's just taking up your storage. This log is kind of no useful without the contextual data. So uh, yeah, generally it tends to like not during the proper contextual data, then this log is of no use. Then another thing, it's like setting up, not setting a proper log level. So while writing the login generally, generally like what, what it feels like, you write the all the logs with the same log level. Like as I mentioned in the example, like all the, all the log is right and that is error. So it can be confusing, like it's a debugging event or it's like a particular operation has been successfully completed. There should be info or debugging. It shouldn't be like error, otherwise it will create the confusion while debugging the particular issue. Because you are getting the multiple events as an error log so you can't filter out the particular error events. And the other thing is like by mistake without the contextual data. You market the particular log as like warning, but it should be a error. So that is also happened. So while like what are code review or bi review? Someone needs to point out that all developer has to keep in mind all the contextual operation has been performed before and after that particular code block and write the logging level as a proper login level. Yeah, one another mistake. Generally we were doing the public log with the contextual data, but that contextual data of the unstructured data. So what, what I mean unstructured data is like in the some lobe they have like a particular, let's say function name, user id and IP address. But in the very next or in the later part of the code they tend to miss that particular info in that or in the other place. Let's say in the example when I'm writing the log for the first operation I am mentioning the function and user id and ip in the log context. Then in the second operation I'll mention the file name and then line number in the message. I'm not mentioning the ip and user id there in the second, so can be misleading unstructured the data. So like when rmb, let's say I want to find what are the operation performed by the user id one. So when I search with the user id one, I'll get only the operation one. I won't get the operation to because there is no user id field in the on load. So kind of unstructured data, you won't get the, all the, all the events or all the operation performed at the particular user id or what are the, as per your requirement. So on logging should be in the structured way, all the details should be there in the old place value mismatch for SNK. So this is also kind of like a repetitive mistake that generally found out one login. So let's find the example. I have the proper sub contextual data, like with the proper key, but in the value of that particular key is type of that value is different. So let's say in the line, uh, key, uh, value in the first operation is 15 in the integer, but in the second and 15 is in the string. And similarly in the data, the first log has the dictionary order JSON value, but in the later, like in the second operation it's the uh, string value. So um, what's wrong in this is like uh, whenever like whatever the software you are using, what are the open source application or whatever the things you are using for visualization code will reject the second record stating that it's not the appropriate, uh uh, like appropriate type of value. So in the sum record is coming as integer, some record is coming in the string. So it will tend to reject the mismatched type of logs. So you like, while visualization or while fetching out the logs, this particular engine in our like list of logs. And it's kind of, kind of misleading that you tend to think that this particular code block is like particular request hasn't touched this particular code block. So, which is kind of not true. So while logging, be careful about the value, type of tech type of the value. And now since we have like discussed what are the common logging like mistakes like, we'll see what are the best practices. So you don't miss out any dogs. So I will like first decide what to know. So you decide what are, what's the proper necessary or message, which someone else can make sense. So generally people write the like vague messages like request successful operation, successful operation fail. It should be writing the proper logging method so someone else can make sense out of that message and they can get the idea what has happened wrong or what has happened successfully. Like that and include all the necessary metadata. So in the necessary metadata, meaning the contextual data, what the, what's the function name where this particular error occurred, the line number and what is the exact error occurred and all, like what's the username, what's the like, users is a name who's like making this particular request, what's this IP address and all this. And so you can get, you can get the best out of the logs and I recommend to use one trace id or any correlation id. So whenever you write a log for that particular request, include the trace id to all the logs during that request cycle. So like whenever you want to fetch, whatever events happen during the trigger cycle, you just need to get the trace id. And once you search with the test id, get the list of all the events are for that particular request. So that the correlation key, it's kind of very useful because it will help you correlate the matrix log. So in observability there are like, it's consist of three main things, matrix logs and traces, right? So it's, and if you are have the micro service architecture, then it's very much useful to correlate the, or to get the traces of particular requests like which all micro services went and what are the operation happening, particular micro services. Then you can, you can link those events and get the most details how out of that particular event. And then it's like, as we talk about, like include all the necessary data or include more data. It's not like you include all these data which, which is USD access. Also you need to decide what are the data needs to be logged, what not least to be dope in login data. Like just keep in mind that you don't load the PI data which is like a credit card number, someone's like id password, some, something like PI data. So just keep in mind you don't log the some like PI data. And while writing define the proper log with the uM, like just keep in mind the contextual place of that particular code block. And then as we discussed, like why, like how people are uh, keeping structure like a log with unstructured. So keep the star, keep the structure of the log defined and stick to them. So your, you include all the necessary details, what you have like decided to have all the places. So it will be you easy for you to like debug the concentration and what are the benefit of the structure log. So like once you have the structure login, you can even better visualize the all the events, how many. So you can better like visualize and send like in the time frame or particular like 1 hour to r how many events are occurred, how many? Like you can maybe land graph, histogram or hit map done like that. So you get the better visualization. Uh, once you have the meta visualization and the proper login, you can see the logs pattern and you can set the proper alerting on top of the panel loads. So like you can set a alerting, let's say alerts me. Particular events happen in the particular time frame it now. So like you get to know if particular events are failing continuously, then you get to know why it's failing. And now you can debug and fix it as soon as possible. And unless you are getting the timely alert and you have the proper login, your incident response time will be because you will be able to handle the incident quickly, efficiently. And proper logging doesn't only like, enables or help tech team to deal issue with fastly, but it also improves the collaboration with the other teams. Let's say the growth or business thing wants to know how many insert, how many events are occurring like in the particular time frame. So let's, I give you example of my accounting. So as we, as I mentioned, like we are the incident management tool. So it's kind of useful for us to know like how many incidents are getting created in 1 hour or 130 minutes, how many instances are getting created for a particular user or particular customer. So, and we can publish this data to the self or grow team to look out for, and from there they can make their analytical decisions and how to go forward, how to, how to run campaign or what are the other things they can watch out for based on the particular matrix. And then what are then after that, once we have the structure data, then we need to decide what are the scenarios you gonna know. So it's not like you, you are gonna know each and every scenarios with the each and every message. That's kind of like information order. There are two log or too many logs. It's also a very like a bad thing because, because you will be get lost while debugging from those load. So decide what are the scenarios you're gonna load. So in the example, as I said, like whenever you are making the require, like it's just for example, when I am like making a request, I'm catching the timeout error, writing the appropriate message for that particular error. Then in the second I may accept the connection error and then writing the particular message for that particular error. So instead of this, I can do it like try catch the, all the errors in the single exception and I can log the old, I can log the message that request failed with the data. So that particular time like I couldn't get to know why request failed. So instead of writing like this and one of like what I recommend, it's like don't, don't, don't try to catch the like get exception, see what the code block about. Then try to think of all the exception that that particular core that can occur and that particular code block and then catch each, each individually write the appropriate message. And if you don't know what can't go wrong in that particular code block, it's better to leave it, don't catch that particular assessment, let it fail. Let your application monitoring generate another, generate an incident and let you know this particular things went wrong and then you come and fix a particular issue. Otherwise once you write the blanket accession you don't know what is going wrong in particular situation and you kind of overlook the potential, potential bug or something which impact your company technically or financially. Anything can be possible. Yeah. So and one of the, one of the most important is like how you storing the law logs. So load storage also one of the important thing because like it shouldn't be too less or too less or too much. So let's say you are storing the log for seven days or like past seven days or past ten days. There might be if the, if you need that past 1520 days of later then you need to restore and restore the older logs. And in that situation just keep in mind of the log retention period as per your requirement or how like how many past days of data you need at the time. Then once you after that particular time period move into cold storage. So it will help to save the cost as well. And then add the, in a certain, like add the, like what are the keys or what are the, like what are the information you are using to what search a particular log. Just make note of that and add the indexing of that particular information on that particular key. So it the searching will be a faster and take the makeup of the data. So like you don't lost the data in case of any disaster. So and alerting. Now we have the logging proper login setup. You can set up the alert roles. You can set up the alerting for a particular event. So like, but let's say event a occurred 1015 times. Then you will get the know, you get the on call engineer will be notified for that particular alert. Saying this event is the particular event is getting failed or in error while performing particular operation so they can see and quickly fix the issue. And then once we have the login, you can better visualize it. Once you have a better visualization out of it, you can, you can write the alert for an only detection. Let's say usually your traffic is ready like 30 40 requests per second and one, and then suddenly, and there's some particular point of time, there are 100 5200 requests per second. Then it's kind of an ugly. So there might be out of like due to the needle someone is doing on your website. So at that particular time you can set up an only detection alert rules which based on your requirement and then it helps you to reduce the alert fatigue. So let's say like you are, like you keep the threshold as a 50 requests per second and then you, your application has grown, grown well, you are onboarding more customers and all these things. Then your request, now generally your request, like 50 60 requests per second. Then you can update this result. And because it's general, it's not the, out of ddos, you are getting 56 requests. So at that time you can update the threshold and, and you can get the proper alerting. And now we have the like a proper alerting setup with the proper logging, with the proper contextual data and the proper visualization, how it can be useful at the three a. So mostly once you have the proper logging structure and all the visualization are there, most of the scenarios which you anticipate can be, can go wrong. You visualize it properly, so you get the idea of the situation or get to know like why it's happening from the visualization itself. You don't need to go to logs and searches for each and every request. And so like basically you can find the anomaly from where the visualization itself. If, if particular errors occur, occur out like no out. If you are expected error bounding, let's say like you are not, not expecting particular data or you want to debug the specific request. And then you can use this like trace id instructors in the log. So as I mentioned previously, trace id or a correlation id, you should log as part of contextual data in the every law. So let's say you want to log a particular request, what all happened during the request second and what like where it went wrong and all the sense you can get the trace id of that particular request and search it in their application log. And you will get the, all the events, all the events related to that particular trace id. And then you get to know like what went wrong? So this is the example I'm drawing. So in the one of our thing we verify the contact number we by sending the OTP. So in order we have logged the all the events as occurred. So let's from the start where API invoked. Then whether need to be verified or not, whether it's numbers already verified, whether the user is in the block list or not due to like some issue with the telecom provider. And then, then we check with the telecom provider. What surprising to sending messages or phone calls to that particular number whether this particular country is allowed from our side or from the network provider side. And then once all the checks pass we, we send the OTP verification code. So from this kind of login you'll get to know what are the events happen successfully and over. Like if something goes wrong then we could have logged like uh, this particular check space. Not changing the or no, not changing the automatic verification log. So then when the customer comes and saying I'm not getting the verification code, then we quickly check that particular request and then we can tell them like why it's like what are the like check is failing and what needs to be done from their side or for our side or based on that we can reach out to network provider file if it's failing from their side. And as we have the like a logging setup and too many logs are firing and storing in this storage. Like it can be like if you have too many logs you can get lost in the old logs. So what generally in the big organization do they run the two parallel stream, one for the info log and what for the debug? Debug log. Debugger level logs. So in the info log consists of the info warning and critical error logs. And debugger logs holds all the logs with, with the debugging level of verbosity. So generally developer has access to info logs and the debug level have access like develop access to the SRE or DevOps guys or someone who's managing the operation. So like when the issue comes, developer tries to debug with the info log. If that particular if is not make the most sense out of that, then they ask for a debugger access and to, to get the like to fix the issue or to get to know what is the issue. And similarly this is like example shows like you can write the particular log level in the particular file or you can write the logger based on the module. So it's up to you how you want to, like how you want to log and yeah, but be careful you don't get lost in the logs. Don't log every information, don't log like each and every event. But look what are the necessary events. Yeah. And how you can achieve this particular blogging. So there are like many applications are available, let's say like new relatable open telemetry, all are there and you can use any one of them. They have their own pros and cons. And I have listed the three most useful like used open source applications. One EFK, which consists of three, three open source tool, Rastrixus, locust, Kibana EFK which is like instead of locksteps you are using the flower fluently and the graph analogy. So what is VFK? So elasticsearch fluent in Kibana. So elasticsearch is the log storage and where you can search the particular things quickly. Fluent is the load connector and collects it, connects the log and push it to the elastic search. And Kibana on top of kibana on top of can visualize the logs and you can, it's kind of will provide you the visual UI or visualization on top of your logs. The form unlocking is also the same. So there are only the components are changing. So instead of flowing bit or you are using on tail from tail here, then low key as your local vector and the graphene as your visualization alert. And you can also use the alert manager to like to set up the alert on top of it. Yeah. So that end from my side. So hope you get the idea how to set up a proper logging and how it can be useful during your transition. So yeah, if you have any question, do let me know. As I showed my LinkedIn QR code in my home islet with you can connect with me and we can get go from there. Yeah. Thank you guys. Thank you for being around for during my talk. Yeah, thank you.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Observability 2024 - Online

June 13 2024

Getting the most out of your logs at 3 AM

Video size:

Abstract

Summary

Transcript

Slides

Shyamkumar Kalariya

Software Engineer @ Zenduty

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2024 - Online

June 13 2024

Getting the most out of your logs at 3 AM

Video size:

Abstract

Summary

Transcript

Slides

Shyamkumar Kalariya

Software Engineer @ Zenduty

Join the community!