Decoding Kubernetes incidents: How to make logs work for you in real-world troubleshooting

Video size:

Abstract

Logs are vital for diagnosing Kubernetes issues but can often be misleading without proper context. This session will dissect a real-world Kubernetes failure, demonstrating how combining logs with application and infrastructure metrics reveals the true root cause and enhances troubleshooting accuracy. Learn how to make your logs work smarter for you in complex cloud-native environments.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, Amir here. I'm the CEO and co founder at Senser. It's a pleasure to be here as part of the Conf42 Kube Native conference. What we're going to speak is about the deficiencies and the problems with working solely with logs when trying to solve Kubernetes incidents. In terms of what we're going to cover, we're going to cover the trouble with logs in isolation. What's missing in them. A lot of these elements are probably a lot. A lot of you are familiar with what additional context look like and why it is important. We're going to go through a short story or a short real life example of an investigation. And eventually we'll wrap it all up and trying to explain the pitfalls and explain how to get started with a more with more advanced techniques. If we'll start and speak about logs, Kubernetes in particular, but in general, of course, they're one of the most helpful tools in troubleshooting. But as we all know, they're not ideal, and most of the time they're not enough. A good way to think about it is that, as any of us know, the problem with logs is that usually they require more logs. And the problem, especially with regard to Kubernetes or any distributed system, is that the rest of the logs which are needed are not. usually placed in the same location. You need to look at another node, another pode and another element. And individual logs lack a lack of context because of a few reasons. One, which is probably clear, a human wrote them, and humans are not very good at predicting the future. So essentially a log, if you think about it very similarly to a line of code, because it is the line of code, but it has more to it, they exist on the basis of experience. What have I seen? In the past or assumption, what type of corner case am I anticipating that my code flow is going to go through and that would represent and we're looking at it at the process level, even though we're sometimes very good at reviewing code. Changes or reviewing any type of code edition, we're not very good at accompanying that with reviewing the logs and what context do they bring in? What is their meaning and different path different passes that the code or the program could take. Another aspect is that other issues are happening at the same time where log is a encapsulation of a single moment. Most likely at a single time of a single event, but other things are happening in tandem, and they have a lot of importance when we're trying to we're trying to analyze the issue on similar prior incidents. If we have the ability to correlate, or we have the ability to look at past incidents and capture this knowledge with regards to a specific log, that would be very beneficial. And, of course, the connection to topology. If I have some way or some technique to know something about the environment, about both the infrastructure layer, the application layer, what elements are interconnected, that, of course, is not something which appears inside of the log. But it's a very common aspect or a very important aspect in doing the analysis. If I know where something happened and how it is connected to the other elements in the system, I can start and build the flow of analysis. Essentially, so my goal is to to give you a little bit of inspiration or a little bit of a methodical way of thinking of how to do it, how to do it in real life. We're going to take a real example in this example. We're looking at part of our system, but that could be done in any other observability systems or from technique perspective to the line with things you're familiar with. What we're starting with is a customer affecting issue. In this case, the front end, the element which is responding or interacting with the user is essentially returning a server error to the user. Of course, it is bad because we're not fulfilling some sort of a business flow, but that is where we are starting with. And the first thing that we will do if we're relying on logs is to try and find errors, which has some sort of correlation into that. We'll deep dive into the log soon, but from customer perspective, we've impacted our meantime to meantime to detect. We've impacted our meantime to recover. There's going to be a long investigation time if we're relying solely on logs. And eventually even SREs. Our people, even though they're sometimes required to do a lot of crazy things. So their sanity is of course in of course in danger. Not only because the logs are not written by them but this entire process is pressuring to begin with. But now there's a lot of other element. within it. So what do when I say context? A context would be how do I take the essence of a log, which is a single moment, single element, single point in time. And how do we tie that in with the fact that the system is much larger. And when we're trying to put this into context, we're essentially enriching the log. The log by itself is not enough. What other element, what other views of the system are important for that analysis. So we brought three examples here. Each and every one of them is important. Each and every one of them is working together as part of the analysis. But first of all, topology. If I'm a subject matter expert. Which is hard. I have a mental view, if you will, of how the system is really deployed, how the system look like very hard to do it. In reality, we're speaking about auto scaling type of system. We're speaking about a system, which has a ephemeral type of type of behavior. So it's hard to do it in in real life or as a. Thought experience, but topology is a very important part. If I have a concrete and up to date, real time, high level overview of how the system is structured, which infrastructure elements are being used, like the type of nodes, where the nodes are located, what infrastructure is being used, what type of services are being used, and how they're interconnect, I start to have a picture, which allows me to understand it. Essentially, the technical flow and from that try to move on to the next step. Another important part is incorporating into this process. Matrices. Essentially, I have a log. Something could go wrong. I know we'll speak about an example, but what happened on the metric level? Was there a performance issue in tandem to that log appearing research being consumed or are or lacking like a CPU or memory? Was there a network? Latency issue or an error rate that stems from a specific session or specific API, which can give me some insight to how to move forward in my decision tree or my path. The word where the root cause is another important element is the deployment history. We all know it, but one of the most important part to enrich log with is what happened from deployment perspective, what elements which were lately deployed by the CI CD pipeline. A lot of the time, there is a strong correlation between an error in the system and a recent deployment. It could be a deployment in the form of a new image. Coming into the production environment, meaning like a new piece of software, which makes sense, it could be a configuration change, which is also deployed by the pipeline. All of this are elements which we want to enrich our logs or our log view with. Why is it hard to get this kind of context the way to look at it the way I've decided to look at it is to look at maybe the two common ways of observing the system. There's a lot of overlapping between them, but just for the benefit of the debate, the commercial observability tech type of solution and the open source observability stack. So it could be. Anyone of the incumbent solution, it involves being able to deploy the observability engines to take the logs and retained logs within the analysis system. There's a lot of common challenges with it doing it in scale. Of course, the cost, or more specifically, the cost to get good coverage. Sometimes you would see people that are employing only infrastructure monitoring part of the logs. Employing ATM only in case of need. So since you don't have all these verticals all at the same times, it's hard. It's hard to get to the solution. And of course, configuring and maintaining dashboard. How am I building? I'm building to begin with dashboards, which are good enough. So they essentially encapsulate enough data and enough. valuable data for me to go from a behavior aspect like a user error to a complete analysis and being able to solve the issue. Open source observability stacks, we gave a few examples here, but they could involve The equivalent of deploying an agent or sometimes even manually instrumenting the production environment, which is hard, not only at scale, but especially if you have various versions of software in your production environment, legacies environment, so on and so forth. And again, the challenges here would be the coverage gaps, the blind spot places, which you weren't able to completely cover the instrumentation. Overhead, both from declining or covering everything on the production impact. So spoiler, what type of things could help you here? So we are, of course working a lot around and automated topology. Not only in our context, but. It's a very important part in order to to enrich the analysis environment. So let's take a look, for example, in the system that I saw you, I shown you in the beginning, we had a front end service. That front end service exposes a lot of services to the user. What we saw in the example to begin with is that we're returning an error, a server error. If we're looking into the log. And logs are always tend to be tend to be messy. They tend to be big. What we'll essentially see here is that the API that was exposed to the user and returning an error is working with the cart service in this e commerce environment. What the log is essentially telling us here is that there is a connection failure between the front end service and the service that represent the cart. We can go and deep dive into it and get some clues about where it could be. But it's but it's not very straightforward as it always it's most likely it's always not very clear. The second element that we will do right now is we'll try to learn a little bit about the environment. So I'm going to show it in a greater zoom. But essentially for opening the environment, which we would see here, we would see. The services, including the front end, we would see the rest of the services they're interacting with. What I'm trying to do is 1st to take or get a glance. The front end needs the car service, which element are implementing the car services. They're an actual database. Which is in player. Is there another layer of business logic or business service, which is encapsulating the card behavior. And if the if there is, and if I can understand what they were doing at the same time, I will for sure get closer and closer to understanding the issue. Relying on the log alone, what would the troubleshooting look like or should look like? To get started, I need to look at the log, as I just did. And I will need to assume that the issue is local. It is happening on that node that showed that error. And then I'll need to go and scrub through the logs. which logs are happening just prior to that event, which logs are happening just immediately after the event. The concept here is the essentially proximity in time. Elements which happened just before might tell me something about the reason. Elements which happen just immediately after can add data for me to move forward with the analysis. There will be a chain event that triggered. That were triggered just immediately after, but they're adding information for me, which are valuable to valuable for the analysis perspective could be an application error. That's coming right with it or a resource issue or a network problem or whatever the challenge is. Now, I need to go and I need to make sense out of it in different players. What happened from infrastructure perspective? Was there a resource? Issue what happened from an application perspective? Was there any sort of authentication problem? Was there a network error? So on so forth. I need to essentially work my way up from the bottom, which is the infrastructure and all the elements, which I'm controlling to a certain extent all the way up to the application. Essentially, by a way of elimination, eliminating the fact that it could be some sort of an infrastructure issue, hardware infrastructure, software infrastructure, all the way to the application level. And eventually, the outcome in that case is that type of troubleshooting is very cumbersome, but it's also very time consuming. I need to go and test these cases. I need to verify my thesis, which what I'm seeing in real life. And eventually, it's very similar to a way of triangulating. I have. various points in the plane, and I'm trying to triangulate to where it might start with. And, of course, most of the time, especially in a Kubernetes based system, distributed system, the problem that I'm seeing is just the after effect. The first element which I'm seeing is not the source of the problem, but it's already stage two, if you will. I'm experiencing an issue, but there is a neighboring node, a neighboring service, which is the source of all evil. Adding context, so what type of contextual enrichment could be done here? So we said the 1st stage was understanding the environment. So we understood. We'll look at the topology, a real time topology. We understood what connected and what not. The 2nd stage would be. Starting and understanding what is the connectivity looking what I could learn from that in order to move yours, move myself move myself forward. What I chose to do in this case, if I'm the front end, I'm returning an error to the user for a cart API, I most likely use some sort of an internal service, which implement The card, so let's try and jump forward and understand how to do it. Assuming. In this case that the issue stems from a neighboring service. Let's try to understand how we're connecting to it. So what I've done here, I quickly filtered used an ability here to look at the APIs, which the front end is a client of, who's the front end internally is using. And here is the list of API. I know from the log that there was a cart issue. Therefore, here is the cart service. And if this is the cart service, It's interesting, because I can start and understand what services what services working with me, and I can now look. Essentially, sorry, I can look I can look at these API's and try to understand if these have errors as well, as you'll see here, which they do. This was a very important part because essentially the two things I've cleared the front end from being the source of the issue. And I've moved my, I've moved myself to the next element in the chain. I know where to focus now. And essentially one of, once I've done this jump, I moved to the card service. Another interesting aspect to incorporate topology with is that if the topology is built to the point like we're seeing here, in which we also have live feedback about the connectivity about the liveliness of various layers between the element. I can already stop here because we can see that the front end is going through that gRPC transaction that we saw the log of to the card service. We can also immediately see that the card service lost the connection to to the Redis card in this case. In that aspect, we're already 90 percent at solving the issue. So the important part here was how am I incorporating various different views On the system in order to not only move myself forward, but to also make reason around what it's probably happening on here. Started from the front end looking at the card service. We know that there's an error there now. Okay. It completes the picture. I understand where the connectivity lost either the application level, networking level, or infrastructure level have taken place. Common pitfalls and how to avoid them. We spoke about a lot of different a lot of different strategies that we can take, but incomplete instrumentation. I want to be super clear. Open telemetry, distributed telemetry, the very important part in the tool set. It's also a very I would say complete strategy, or it's a very beneficial strategy in order to understand problems in the level of a flow of a technical flow, even the message level or the transaction level. But a lot of the times what I see, and I'm most likely believe that most of you are seeing is that environments are partially. There will be element, there will be floats. which are implemented or instrumented, but there is a lot of blind spots. And when there is a blind spot with instrumentation, it's very hard to feel, very hard to fill the gap. Here, eBPF will be and is a very important element because of the ability to auto instrument the environment, because of the ability to attach the existing tech stack and do that do that instead of going and manually, putting putting instrumentation inside of the code, inside of the code. Another aspect, which which is usually pitfall is the reliance on tribal knowledge. Essentially a system that has within the code or within its history and automation is is is a way of blinding yourself. So you need to make sure that you need to make sure that the analysis is taken into account previous accounts, but that it has enough degree of freedom or enough new data. For you to to cope with new situation. The last one is incorrect heuristics. Essentially, if you're looking at a subset of the picture, or if you're looking at the small, I will say through a small hole into the system, you're licking the white system context. So troubleshooting is often focused on a single point. On a single node or a single service in which the issue was observed, but in most cases, that's not what's happening in real life in distributed system. You will always jump to the next node. You will always jump to the neighboring service and the root cause would most probably. Be there. It could be a node or a service under your responsibility. It could be a third party or an external service cloud service. So that has to be the assumption and has to be something which is focused on and in mind, because that's most likely where things are stemming from heaps for getting started. As I said, regardless of the specific system, try and have try and have, a very strict process in thinking and incorporating topology, how topology is coming into play, how you're using it, not only to ramp up or to onboard new people or new engineers and educate them, but also inside of the, or as part of the analysis. Process. How am I using topology? How am I gaining some sort of some sort of reasoning or some sort of logic around how, what am I seeing is connected to to the system float the use of matrix, how to choose the right metrics, what will be the golden metrics, which I'm employing to the to the issue or to the issue analysis and deployment, as we said, most likely, A lot of the time, a configuration change or a software change is going to be either directly or indirectly the reason of the reason of fault. So very important part in enriching the log with context, adopting the layered approach, try to go bottom up. Is it an infrastructure node level issue? Is this specific service issue? If not using the element that we spoke about before, what is the surrounding environment? What is the high level insight that I could get and understand? Not trying to solve the issue at that stage, but what would be the most logical next step to take and move forward and continue with the analysis. And if, as I said, avoid context building pitfalls. Considering for complete coverage or other services that automate your topology, to avoid manual instrumentation and avoid the pitfall of what's happening when you don't have instrumentation end to end in the environment. And of course, if you've seen the elements that we're working with, or the elements that we are developing, AIOps in general. What type of artificial pipeline or artificial intelligence elements could be added to the analysis part, either at the metric level, topology level, deployment level and help with the automation of going to the root cause. I hope you enjoyed. It was a very enjoyable and a very wonderful event for me. I wish you all a great day and hopefully you'll spend. More time deploying new new production environment and deploying new production capabilities and spend less time analyzing.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

Decoding Kubernetes incidents: How to make logs work for you in real-world troubleshooting

Video size:

Abstract

Summary

Transcript

Slides

Amir Krayden

Co-founder & CEO @ Senser

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

Decoding Kubernetes incidents: How to make logs work for you in real-world troubleshooting

Video size:

Abstract

Summary

Transcript

Slides

Amir Krayden

Co-founder & CEO @ Senser

Join the community!