Kubernetes Minus the Blind Spots

Video size:

Abstract

In this session, we’ll share practical strategies for combining topology insights with existing observability tools to pinpoint issues faster and resolve them with confidence.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Well, hi, everyone. Amir here from Sensor. I'm the co founder and chief product. I'm very happy to, be part of this session in COM 42. DevOps 2025. the name of this session is Kubernetes minus the blind spots. We're going to focus around how to solve our Kubernetes observability challenge challenges and how that becomes a little bit easier when you can incorporate real time topology. so what are we going to cover today? First of all, all of you, most of your Kubernetes users. So, you know, it's great. But it's also very hard to troubleshoot. we will delve into why it's hard. To troubleshoot when we go a little further, we were going to connect it with the challenges of traditional observability toolkit, meaning if you have the classical logs, matrix, whatever, why that coming alone is not enough in modern dynamic environments, we're going to cover the concept of real time topology. How it different, how it different from dependency mapping, what to do, why and how to implement it in real life. We'll give a few real world use cases and try to tie in the concept of topology into, into one of your tools, into one of the tools in your toolkit with regard to the troubleshooting process. We will see why eBPF is a very important enabler. with regards to that, why the use of EBPF, EBPF tools, EBPF open sources, can yield the very, can yield very high benefits, to, enable the implementation of real time topologies, and eventually a few tips of how to get started, whether it will be with a commercial tool or, with an open source tool. So as we know, Kubernetes is great. We use it. We understand what we use it. Sometimes we don't understand what we use it. But it brings a lot of interesting challenges with regards to troubleshooting. There's a few, a few reasons to that. First and foremost, the distributed nature. Like, there's no need to explain, but when we used an, when we used to an implemented monolith, they had a lot of problems from operational perspective. They had a lot of problems from managing. these, these big piece of code, but from other aspect, they were closed, they were easier to debug with the distributed nature of Kubernetes. A lot of things happen. But first of all, we have a lot of more moving parts and do. Kubernetes is very uniquely situated because it's a software layer, but it's a kind of a software infrastructure layer. In reality, when there's an issue and the kubernetes based deployment, it takes time and it's not that easy to understand whether it is an application issue. Or a Kubernetes issue, meaning how do you differentiate, this blur, the blurred line, between application and infrastructure issue. The second one would be the dynamics. the ephemeral environments or, the ephemeral, nature of Kubernetes environment, which pods are coming up. C put are coming up, port ports are going down. CICD pipelines are causing configuration changes are coding causing code changes. It's, of course not that easy. to track that, which also it's also adding more complexity, more complexity when we're trying to cope with travel shooting, complex networking. A lot of the things that Kubernetes does is to create obstruction upon obstructions upon obstructions. You can deploy two pods on the same server. You can deploy two pods in adjacent servers, but Kubernetes will try its best to make it seems like they're all one part of the same application. And that, in real life, can create a lot of troubleshooting, troubleshooting cumbersomes. So elements like for DNS or proxies or the CNI plug in, which eventually creates A flat networking model between them are also adding to the, adding to the troubleshooting, troubleshooting complexity. Eventually, it's not that easy to do. Sniff the network or look at each and every packet and understand what is going on just by, by means of the, the sheer obstruction layers that kubernetes is imposing into the deployment shared resources, and noisy, noisy neighbors. Of course, if we were. In a world in which I have complete control over the infrastructure, meaning that when I'm looking at a pod, or I'm looking at an application part, I don't need to account for what the infrastructure is doing, whether the infrastructure directly or Because of another deployment, which is situated on the same note and by itself can cause metrics and cause other, other telemetry, other telemetry to be very noisy. So all of that is adding to the to the aggregate complexity. if you will, black box or black boxes. Meaning things that I don't really control. This could be infrastructure pieces, pods, which are pods, which are third parties, which I'm using or containers, which are third parties and automation complexity by itself. We touched it. We had a lot of other sessions in the past about how CST pipeline by itself is essentially a noise generator when coming to troubleshooting. The fact that I'm trying not only to troubleshoot or trying to shoot at the target, but that target is, All the while, also, also moving. so I think in, in very high terms, that should give you a glance about, the different elements which are the cause of complexity, with regard to troubleshooting. We gave an example of logs on the right hand side. I think logs are a very good example when speaking about distributed environments because you're looking at a single point in time in a single element. And I always say it, the problem with logs is that they require more logs and rarely on the same place. The challenges with the observability toolkits as they exist today. so a lot of these, observability toolkits were. Actually, born, raised, nurtured, for, for the world of, for the world of monolith. So, a lot of things that, started there, like matrix and log and traces are really Coming from a world in which you should have a complete view over your application or your environment and as we touched Touched on before that's how that is something which is a massively changing Massively change when looking at distributed environment metrics Everybody knows the concept, but we're looking at some sort of some sort of a quantity Or a number. Usually it's there to track some sort of a performance, a performance, telemetry aspect. So they allow you to track trends or to see absolute values, but eventually they're just numbers. You have nothing to do with them if you don't have some higher context to what they mean, to what they mean, in, to what they mean in the current time or in the current situation. For example, what's causing this spike in the CPU usage? There's a lot of other questions that you need to answer that will help you understand whether the spike in the CPU usage is something, worrisome or just normal behavior of the application. logs to the very course. They can provide you some granular detail about an issue, but they're fragmented. You see bits and pieces of logs from various elements in the system, and there's no one else other than an engineer that can tie together whether log A on point, on point A and log B on point B has anything to do with one another. Or generally, are there correlated, to the same issue? for example, I have a pod, I have an issue in a pod. What would that mean on a downstream element? For sure, we'll see logs there, but how do I do the logical connection between them? it's an interesting, an interesting question. A good example would be losing a connection to a database, from a client pod, and then seeing the log on, the database pod. There is a connection to, to be made there. traces. which is, oftenly, oftenly used, to go and go over, a lot of these type of, challenges. So they show service, interaction. The real question is, how do you put them in place? And what is the effort, to put them in place? Meaning essentially instrumenting the environment. It could be easy for some element, which already supports it, but for legacy code or for other types of code bases, it could be a challenge by itself. eBPF, of course, is a technology that could help a lot with that. But achieving really traceable code on a high scale, high scale environment is, is not that easy. Very similar to getting coverage, coverage on code. so, essentially, what we see in a lot of environments, people have, they want, to instrument the environment, but the gap, to get the environment fully instrumented is very, very large. moving forward, let's speak about the concept of real time topology. I think just for a basic understanding, all of us. Always has some sort of a chart. It could be something that we drew. It could be an architecture slide or architectures chart of the environment. And the reason behind that is to give us the overall context. Right. There's a lot of moving part, but if I want to capture in a glance what the system is doing, how it's built in high level terms. So for all of us, the concept of using a chart is very, very natural. Another concept which was evolved throughout the years is the concept of dependency net. Maybe I can create some sort of a static chart but dynamically generated chart updated in updated in some sort of a low time interval, and create some sort of the view, which mimics what the environment looks like, at least in the in the last couple of minutes or in the last couple of hours, real time topology as we will touch it is like the involvement on that into something which is fully dynamic. Real time topology in that aspect should provide us the context required to eliminate blind spots. I want to generate more, more, more, bits and pieces of context. And help me understand how the metrical part traces parts logs part are really connected in real life and what should come into my mind. And with that real time topology is one of, one of the most important tools. What is it? If we'll try to summarize it, it's like map of system relationship in the Kubernetes case, it will be encompassing all the cluster layers and then spaces layer going to the more granular elements like pods and services. From the software perspective, most of the time, if it's a complex enough and smart enough, smart enough, real time topology, it will incorporate the infrastructure, element each in. It could go even much higher and show the interactions, but not only the interaction as a line from here to there, but also really, these are the type of APIs, the type of information, the type of transactions that are traversing between the elements. Why should it matter? Why should it matter to us? The ability to do this logical connection, of course, gives us an element that we would have been doing in the troubleshooting process anyway. If I have an issue, let's say an API which now returns an error. What I would do as an engineer is start and backtrace that and try to follow a logical, a logical line. My API is Bringing an error to the user. So it's probably experiencing, experiencing some sort of an issue by itself. What would be the cause of that issue? Let's look at the log. Let's look at the metric. If I understand that, let's backtrace. One step before that, could it be that something like a database transaction couldn't be completed or it couldn't free the value and so on and so forth. And we're starting to essentially build a path on a graph or a path on topology. So the concept should be clear because essentially we're mimicking the natural way of thinking for engineers, which is the backtracing. So that allows us to connect individual points into some sort of a more. Full view. How does it work? so essentially we're taking or we're continuously updating, the real time topology to reflect the real time changes, new services, new deployments, new interactions. And that allows us to grasp this, massive amount of data in, in a way, which is, easy to, easy to present. Let's give, let's give an example. What we see in the following example is the real time topology, in this case generated by eBPF, which allows us to, to track both external services, something which is not part of our deployment, but also deep dive into the structure of Kubernetes, looking at namespaces, looking at the deployments and the pods, and even the interaction between them to the protocol or API level. So for sure, if later was, we'll discuss it, if something happened here, metrical evidence, log evidence, trace evidence, being able to, being able to project that. On the topology could, explain a lot of things or as part of the elimination process in troubleshooting allowed us to focus on the right things. Let's give, let's give an example. Let's think about a kubernetes, kubernetes, kubernetes pod, which is failing on readiness could happen because a lot of recent, a lot of reasons. And let's say it's failing due to, IO, it's failing due to storage. It could be a local storage. It could be, by the way. An external storage. Database in that aspect is a storage. Any sort of, an external, external repository would be, a storage in that case. Being able to see that on a graph, being able to connect the dots, understanding that there is a correlation between how this pod, readiness in this case, prose is behaving with regards to an element is connected to, it's much easier to deduct when you have a real time map. I'm connected to. This database, this storage element, so on and so forth. Logical for me to look at that. so with regards to that, mapping the real topology or mapping the pod dependencies allows us to start and shrink the amount of element that we're looking at. And in that case, it allows us to understand that we need to focus on certain aspects. For in this example, The desire, which is really, really, really the root cause and then eventually decide what to do, either adjust the probe configuration or change other configuration, enable, enable, enable different storage or other storage configuration. Another example, which a lot of time, or a lot of you have faced in the past is throttling, especially throttling when there is changes in the dynamics of the deployment going up or down. for example, in this case, the scenario. We're speaking about an application, which, which is under high traffic. This usually in a lot of deployment, there will be a connection between the amount of traffic to the size of the deployment through a load balancer, through whatever other means. But essentially what we're looking at here is the triggering of an HPA, of a Port Authorized Scaling. but eventually a lot of the time the deployment will scale. It doesn't mean that the infrastructure scale at the same pace. Or is well enough ready in order to absorb the scaling. And a lot of time in this situation, we see things like throttling, the number of pods. Has changed, but eventually we're looking at, throttling from the infrastructure, perspective. And even though we scaled, we get into a scenario in which we get malfunctions or we getting into the, the specific pods not being able to handle, handle the load. how does it, how real time topology will help here. So of course, one being able to. Visually see that being able to see the different interaction, visualizing the resource allocation. Are there? More nodes or nodes group, being allocated as part of the code scaling. and that would enable teams together with the other, other observability telemetry data, to either adjust the resources or limit the bandwidth or. To do whatever it takes in order to, in order to solve it, but just the ability to understand that this is an infrastructure issue. This is not an application issue is, of course, already, at least 50 percent 50 percent steps toward towards the solution. Misconfigured network policies, something that we've all seen and suffered from. I think it's a very easy case to understand how topology can help because the fact that real time topology is in many aspects. Very similar to a network topology map. So I think it's an easy aspect to understand. And other aspects, like we spoke about before, which are real time topology. Topology is not only logical connectivity, it could also be infrastructure connectivity, or whatever connectivity, but I think this one is very easy to comprehend. So, for example, we have a pod or a microservice deployed, and it failed to communicate with another service. And it doesn't even have, even have to happen. It could happen sporadically, but could happen because of the configuration change in real life. the complexities of getting into this, this scenario could be almost infinite. that's happening because of a network policy. It would be very hard. Or very hard without real time topology to either comprehend that that is the issue or more than that, understand which adjacent services I should look at. it could be also very hard to debug only with something like logs. Right. There is a lot of sessions going on. It could be any single session, which, which is failing, which will give us an issue here. So really, the lack of detailed logs or visibility or traces, if that specific path is not, instrumented in the, in the deployment, it's a very hard scenario to, to diagnose. But the ability here to have the service to service mapping, looking at the APIs, maybe on the topology itself, mark. When there's an issue in one of these paths, be another great, another great, idea, to do that. but the ability to, to show that or even understand it might be here because. I see logical connectivity, or I thought there should be a logical connectivity, but there is none, meaning that this could be really a network policy issue. it's a great example for, for the use of that and the amount of automation that, you can put around it. can expose a lot of these, a lot of these cases. So what enables real time topology? essentially, there's ways to implement. It could be done with things like distributed traces if the environment is instrumented, that essentially following the distributed, distributed traces path could be a tool to create real time topology. The one we're focusing on here is, of course, the use of EDPF and with the use of EDPF. the ability to dynamically create a real time topology. eBPF, I think, today, most of you already, already understand it. It's a piece of technology, a part that's natively supported by, the Linus kernel, extended Berkeley packet filtering. Essentially, the way to think about it is a small type of VM or feature within the kernel that allow us to run or safely run Low level instrumentation code beats or pieces of code, which from one hand are instrumented and the kernel on the other hand, will bring data to user space, kind of benefiting from, both of, both of the world, main uses. So of course traffic analysis. That was the original. used for eBPF, the ability to trace network requests, messages, packets, sniff them. Of course, the most classical use, the importance for topology is the ability to track APIs and the ability to, and the ability to, and the ability to trick, essentially any kind of, session. infrastructure, app and performance monitoring. So it's not only the ability to hook into, the networking stack, but also the ability to hook into kernel calls or to hook into application calls. These are very important because it allows eDPF to create a granular, more granular level of instrumentation. What kernel calls are being called could be important for a lot of use cases. What application calls are being issued by the application could be important for things like opening encrypted streams in HTTP2. It could be the use of open sources and tracing. That's essentially how to auto instrument. pieces of code. If I can find in memory those function calls, of course, I can add I can at the trace point and enable that enable that instrumentation. And all of these things which are very relevant to observability are naturally very, very important for security with a change of the use case networking through enforced network policies and infrastructure and app call to Protect or detect misbehavior, misbehavior with regards to things like open source usage or kernel usage, resource usage. So a lot of similar aspects, but the use cases on top are very much different. So why does it matter? eBPF, of course, enables deep observability. It's also native in Linux. In my perspective, that's maybe one of the most important things, it's like the native tool being, put there in order for people implementing this type of, this type of mechanism. it's, invaluable, for complex distributed environments, both from security and observability perspective. So why is eBPF ideal for distributed systems? there's a lot of, building blocks to that. put a little bit of a caveat, caveat here, also very, very, dependent on the implementation itself, but on the forefront, if you're doing it right. The ability to implement low latency monitoring, small pieces of eBPF code, bringing the relevant data, capturing the right element from the network pockets can allow us to discover the environment and understand the connectivity without creating confusion. Latency overhead, or without creating other aspects, which will implement the overall performance of the application. Enhanced visibility, so the ability to go and discover, or to go and instrument specific elements which are important for us. It could be, It could be distributed queues. It could be database transaction. It could be whatever is important enough for us to deep dive into and we don't have necessarily an alternative to monitor it at a fine level, at a fine granular level. Improved efficiency. So, as a building block, security, observability. What have you, it can reduce the need for other or specific diagnostic tools. So being able to use eBPF as a building block to build the rest of the chain makes, makes a lot of sense. We already see it, by the way, not only from a visibility perspective, but eBPF is being used on the platform to implement many things. So it could be load balancers, it could be, it could be, service meshes. The observability tools, security tools, and the scalability. this being a very native, Linux, Linux supported, Linux supported technology being basically automatically enabled on each and every node, is super important. Maybe on a different aspect, the ability to implement it not only for. Physical node, but for example, implemented for, other ephemeral, other ephemeral, technologies with sidekicks or things like that, can allow us to create a more consistent, observability approach. auto discovery. so in auto discovery, we're going one layer above it. It's not auto discovered. Only I will call it the data, but it's how we connect this data to a data model, something which takes it one layer above it and could allow us start would allow us to start and speak about things like hierarchies or structure or, or create a more formal way of defining that, that would be the first aspect. And, of course, with that, things like the use of machine learning or the use of pattern matching in order to take. the data from Kubernetes in order to take the data that we get from eBPF and match them or mesh them together to create not only a dynamic graph but to put this data modeling on top of it and essentially creating something which is repeatable. This is a pod, this is a new pod or a new service, we have a node group or we have we have identified a part of our application which is a database or a business logic and then again and again and again even when data changes still have it in the, still have it in the same format, without essentially requiring, requiring any, manual updates. common pitfalls. even though the concept is relatively, relatively simple, it still, it has a lot of, a lot of ways to get, to get in the wrong direction. one, incomplete instrumentation. Eventually, we're trying to create a graph or we're trying to create or look at a chart. The chart will be as good as the data which is there creating the chart. So missing critical data like logs or API or events or any other data eventually will create an incomplete view and it will not be used. So it's important not only to think about what data should go into there. Of course, eBPF can help with that, but also to have in mind what is the most important parts to have there to begin with. Stale topology maps. I think in many aspects, the tendency maps are maybe suffering from that exact problem. Kubernetes environment or any distributed environments are constantly changing. If the topology updates are not, Automatic and sure if you're not being updated in a quick enough cadence, then the risk of is the risk is, of course, taking decision today on yesterday's data, assuming the topology can solve everything. I think the most trivial one, but topology is, it's very powerful for troubleshooting, but it's rarely a smoking gun, meaning what it gives you is it gives you the more context that you need in order to go in the direction. of the root cause, but by itself is not a root cause. It's just a way to formalize or to model the observability data. Above that, or a layer above that, that's the entirety of the concept of AIOps. Assuming that I have that, and assuming I can extract this context, Can I now go and automate the root cause process, or go at least, into a process which allows me to lower the amount of alerts that I have and go in the direction of, in the direction of the solution. Tips for getting started. Assuming you want to do it by yourself or with a combination of open source tools or with commercial tools, which allow, remember the metadata. Yeah. You can create, or help yourself, and choose a strategy with regards to creating, to creating the data, the dynamic topology. But the Kubernetes metadata, it's eventually detects things like labels or names or config maps or events in Kubernetes. If you're putting them on the real time topology, that's eventually how you populate that, with, Populate that with the most important context. Yeah, I have the structure now. I also have everything on top automate the topology map So when you're looking and scrubbing for solutions, the dynamically up the topology think about these elements How fast are they? How fast are they updating? Do they take into consideration of the Kubernetes data, the infrastructure data, how fast they are to respond to frequent state changes, and eventually avoid context binding pitfalls. Environments are large, environments are complex. It might be beneficial not to try and model anything, everything, and not to try and put everything in there to begin with. It's our role as engineers to understand the Pareto rule. What are the 80 percent of things that are important enough for me, to look at, and then expand the process. It could be good enough to start with something which is Relatively small or relatively small, but sufficient to give me the view that I need and then go and go in and enlarge that as the system involved or as I have, have I have, as I have more resources to do so. I hope you enjoyed that. If there's any question, just shoot us an email. and I really enjoyed having this session with you. Thank you.

Slides

Download slides (PDF)

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Kubernetes Minus the Blind Spots

Video size:

Abstract

Summary

Transcript

Slides

Amir Krayden

Co-Founder & CPO @ Senser

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Kubernetes Minus the Blind Spots

Video size:

Abstract

Summary

Transcript

Slides

Amir Krayden

Co-Founder & CPO @ Senser

Join the community!