Debugging at Scale in Production - Deep into your Containers with kubectl debug and observability

Video size:

Abstract

“Debugging is twice as hard as writing the code in the first place”-Brian Kernigham Debugging production is harder! To make matters worse, containerization & orchestration made debugging even harder. In this talk we review the debugging of a production app deployed with k8s with kubectl debug & more

Summary

Today we'll talk about kubernetes debugging at scale in production. This is my Aprils book titled Practical Debugging at Scale. I also want to use the best practices because to me staging is really a grand rehearsal to production.
Left a remote debugging port enabled in deployed server code is considered a huge security vulnerability. If we open the server to remote access with JDWP, we might as well hand the keys to the office to hackers. A better approach is tunneling.
Developer observability stands for a new generation of tools geared to bring observability changes to the developer community. Reading production logs shouldn't be segregated to a separate team. We should shift in both directions and have a wider scope of responsibilities to qualify to quality and production.
Thanks for bearing with me. Hope you enjoyed the presentation. Feel free to ask any questions and also feel free to write to me. Check out debugagent. com, my book and my YouTube channel where I have many tutorials on these sorts of subjects.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello everyone, I'm Shai Almog. Today we'll talk about debugging at scale in production, specifically about kubernetes debugging. I'm assuming everyone here knows the basics of kubernetes, so I will dive right into a basic problem description. But first, let me say a couple of things about me. I wrote a few books, including one about debugging, which covers all of the topics we'll discuss here. I've worked in this industry for decades, in many companies, and as a consultant. You can contact me over my socials listed here and follow me on LinkedIn, Twitter, etc. My dms are open, but please try just tweeting at me and if I don't answer, I just missed it with all the flood messages I keep getting, so please try again. Also, I'm on mastodone, et cetera, so you can reach me there too. This is my Aprils book titled Practical Debugging at Scale. I've put a lot of work into this book and I can guarantee there's nothing out there like it. I hope you check it out. Everything in today's talk is there, and a lot more. I gave a version of this talk in the past. After I was done. They generated this absolutely spectacular mind map of my presentation. This is about an older version of this talk, but I still like this a lot. I wish I could draw stuff like this. I still have it because it's pretty close to the new talk. One of the first things I hear when I talk to some developers is why debug production? Isn't that the problem of the DevOps team? It's like they're a part of a completely different universe. We write the software and they make it run. We live in silos where production isn't our problem. It's that other monkey that's responsible for that, not me. What we're building here isn't perfectly running unit tests. It's a product that's running in, you guessed it, production. The thing is that when we say failure, we aren't talking about a crash. This is one of those things that we seem to associate with production bugs, even though that's a small part of the problem. When a debug gets into production, it's often the worst kind of debug, only happens to user X when the moonlight hits the water at 37.5 degrees on a Tuesday. To make it worse, I don't think I'm breaking any news by saying that kubernetes debugging is hard. I'd argue this is an understatement, at least for some of the problems we run into. We will just use exec to log in to a container and see what the hell is going on. This is like working with my local shell, which is pretty great, but it has some problems. If I'm debugging a local install of kubernetes, that's okay, but even if I'm tracking an issue on staging, I don't want to disrupt other developers working on this. I also want to use the best practices because to me staging is really a grand rehearsal to production. So we want to make the best practices we have there and take the best practices we have there and use them as much as possible. The most obvious problem with exec is that when we log into a server we don't have various tools. If I want a particular tool, I need to install it. This can work to some degree, but it creates baggage dependencies that I fetch might break the processes running on the container and disrupt everything. That might be okay in staging, but in production we really should never do that. This is if the containers is even something I can physically work with. It can be a skeleton containing nothing to keep it as minimal as possible. That's a great practice to reserve resources, but it makes it pretty hard for an old Unix guy like me to track an issue. How can I run stack twice now? S trace now when a containers crashes I have nothing. How can I log to that and work from there? This is problematic. We can use Kubectl debug in much the same way as we use Kubectl exec. It effectively opens for us a command prompt on a container, but it isn't our container. The ephemeral container is the kaiser soze of containers. It wasn't invented by Kubectl Debug. We had them around before, but they were a pain to deal with. Kubectl Debug makes it easy to spin up a pod next to our pod that lets us reproduce the issues and track the problems. We can install and preinstall anything we want. It will be gone completely once the pod is destroyed. We can share process space and file system with the pod, which means we can look into a crashed container and see what happened there. Since we're technically not inside the container, we can peek without damaging. We can use various base images which already have software that we want pre installed. This makes it easier for us to debug quickly. We might be looking at a container running alpine, but run Ubuntu on our own container when I worked at lighter and we launched the coolkits project, which was an uber container with everything you might need to debug a container, you might find that useful. So here's the first problem we run into with a debugger. We can't just start using it, we need to relaunch the app. With debugging enabled, that means killing the existing process and running it over again. That might not be something you can just do. Furthermore, running in this mode is a big security risk. Notice I limit access to only the local server, but still, it's a big risk. Leaving a remote debugging port enabled in deployed server code is considered a huge security vulnerability. If hackers can ride a different vulnerability, they might be in a position to leverage this from a local system. Still, if we do this for a short period of time, this might not be a big deal, right? In a different window, I need to find the process id for the application I just ran so I can connect to it. I can now type it into the JDB command, and now I'm connected with a debugging. I can add a breakpoint using the stop at command. Naturally, I need to know the name of the class and the line number so I can set the breakpoint. Once stopped, I can step over like I can with a regular debugger. However, this is a pretty problematic notion on multiple fronts. First off, I'm literally stopping the threads accessing this application. That's probably not okay on any container you have in the cloud. There are ways around that, but they aren't trivial. The second problem is different. I'm old and a Unix geek, so people automatically assume I love the command line, and I do to some degree, but I love Gui more. And when I started programming, there was no option. We didn't have ids on a sinclair, an Apple two, or a PDP eleven. But now that we have all of those things, I don't want to go back. I programmed Java since the first beta, and this was actually the first time I used JDB. I'll use command line tools when they give me power, but debugging via commands, I just can't. The obvious answer is JDWP. We have a remote debug protocol that's supposed to solve this exact problem, right? But this is problematic. If we open the server to remote access with JDWP, we might as well hand the keys to the office to hackers. A better approach is tunneling. During the age of vps, we could just use Ssh tunneling like this we'd connect to a remote host and forward the port where the debugger was running locally. Notice that in this sample I used port 9000 to mislead hackers scanning port 5005, although it wouldn't matter because it's ssh. We can do the exact same thing with Kubernetes using the port forward command to redirect a remote JDWP connection to localhost. Port forwarding opens a secure tunnel between your machine and the remote machine on the given port, so when I connect to localhost on the forwarded port, it seamlessly and securely connects to the remote machine. Once we do that, I can just open Intellij idea and add a local configuration for remote debugging, which already exists and is pre configured with the defaults such as port 5005. I can give the new run configuration a name and we're ready to go with debugging the app. Notice I'm debugging on localhost even though my pod is remote. That's because I'm port forwarding everything. I make sure the right run configuration is selected, which it is. We can now just press debug to instantly connect to the running process once it is done. This feels and acts like any debugger instance launched from within the Ide. I can set breakpoint and step over and get this wonderful debug gui I'm pretty much used to. This is perfect, right? The first thing we saw when debugging was the need to restart the process. That isn't practical production. Unfortunately, we can't leave JDWP just running to debug a real world project. I usually say that JDWP isn't secure, isn't secure, this isn't the case, this isn't correct. It's a wide open door. There's no security to speak of. Then it's breakpoints they break. I heard a story years ago about a guy who debugged a rail system and it literally fell into the ocean while he was stopped on a breakpoint because he didn't get the stop command. I don't know if that's an urban legend, but it's totally plausible. Remote debugging APIs are a stability risk. How many times did a debug process crash on you? Imagine adding a breakpoint condition that's too expensive or incorrect. It might destroy your app in production. Just loading the variables in the watch can cause unforeseen problems in some cases, but this is the absolute worst. Say your app has a place where it takes user credentials for login, maybe a third party library you use does that. You can still place a breakpoint there and steal user login details. Worse, you can elevate the permissions of your own user account because everything is writable. Just set a value of a variable. This goes against privacy laws and is very damn likely 60% of security breaches in organizations happen because of an employee. There's often a misconception of security that targets the outside world only. This just isn't true. This isn't the big picture though. These are all problems that we would potentially have when running one server or our own hardware connected to the Internet. Kubernetes brings scale, and with scale we have additional problems that we don't normally face. People complain about debugging, multiple threads do that, while some of the threads are in the other side of the world. So this is a typical Kubernetes high level architecture then. Our code as developers is this tiny area that isn't even covered by kubernetes. We have this huge infrastructure to run our code, but once it's running somewhere, and by somewhere I mean literally I have no idea where, then I'm on my own. Kubernetes will help with running it and guarantees, but it won't help me debug the code and adds a lot of layers between me and my code. Observabilit is the new debugging, which I hope is pretty obvious to most of us. But it has many pitfalls and many limitations when compared to real debugging. The cost of observability is one of the things that we were just starting to understand. Logs alone can be 30% or more of your total cloud spend. That's absolutely insane. The cloud was supposed to reduce costs. Instead the reverse is true. The most problematic aspect is that most of the observability technology will work with is geared towards ops and less towards r and D. It might say that a particular endpoint is experiencing issue, but it won't point at a line of code or give us the means to debug it. If we have a bug lock, a specific user that sees wrong information, that's common and can happen because of problematic flushing of caches, how do we debug something like that? We need to create observability dashboards to the r and D team, involvement in the day to day observability tasks is a must. Reading production logs shouldn't be segregated to a separate team. I'm not saying that SRE shouldn't exist. I'm saying that we need vertical teams where the SRE is embedded within them. We should shift in both directions and have a wider scope of responsibilities to qualify to quality and production. Debugging shouldn't stop at the boundaries of the CI process. Developer observability stands for a new generation of tools geared to bring observability changes to the developer community. What makes them different is that they offer source based solutions. That means we work with developer terminologies like line numbers and source code. I'll demonstrate light run because I'm familiar with that, but there are other tools in the market. I used to work for Lightran and wrote a lot of what you'll see, but I no longer work for them. There are plenty of other solutions in the market with different degrees of functionality. I didn't compare them because it would be futile. The market changes too quickly. I hope you'll get a sense of what's possible thanks to this demo. On the left side is intellij idea, which is my favorite IDE. Some developer observability tools integrate directly into the IDE, which is convenient to developers as that's where we spend our time. Other tools have web based interfaces, et cetera. On the right side I have an application that counts the prime numbers running on a remote server. We can see the console of that demo. The application doesn't print any logs as it does the counting, which makes it hard to debug if something didn't work there. In the side of the ide we can see the currently running agents which are the server instances. We also see the tags above them. Tags let us apply an action to a group of server processes. If we have 1000 servers, we can assign the tag production to 500 of them and then perform an operation on all 500 by performing it on a tag. A server can have multiple tag designations such as East Coast, Ubuntu 20, Green, et cetera. This effectively solves the scale problem typical debuggers have. We can apply observability operations to multiple servers. Here I have only one tag and one server process because this is a demo and I didn't want to crowd it. Another important aspect to notice is the fact that we don't see the actual server. This all goes to management servers, so production can be segregated behind a firewall and I don't have direct access to the physical production server. This is important. We don't want R and D developers to have actual access to production. I can see that a server process is running. I can get some information about it, but I have no direct line to the server. I can't ssh in and I can't change anything in it. I can add a new log by right clicking a line and adding it. I ask it to log the value of variable I and it will just print it to the application logs. This will fit in order with the other logs, so if I have a log in the code, my added log will appear as if it was written in the code next to it. They will get ingested into services like elastics seamlessly, or you can pipe them locally to the IDE. So this plays very nicely with existing observability while solving the fact that traditional observabilit isn't dynamic enough. The tools complement each other, they don't replace one another. Notice I can include complex expressions like method, invocations, et cetera, but Lightran enforces them all to be read only. Some developers, some developer observability tools do that while others don't, but the thing I want to focus on is this. Notice the log took too much cpu and Litran pauses logging for a bit so it won't destroy the server performance. Logs are restored automatically a bit later when we're sure cpu isn't depleted. Snapshots are breakpoints that don't stop. They include the stack, the variable values, and all the stuff we need. We can use conditions on snapshots, on logs, and on metrics, just like we can in a regular debugger. With a conditional breakpoint, we can apply everything to a tag which will place it in multiple servers at once. We can then inspect the resulting data like we would any breakpoint. We can click on methods and view variables. We can step over, but our app will never get stuck. The demo was a bit simplistic. Here is a different demo that's a bit more elaborate. In this demo I will add a new snapshot, and in it I have an option to define quite a lot of things. I won't even discuss the advanced version of this dialog in this session, but I will add a condition for the snapshot. This is a really trivial condition. We already have a simple security utility class that I can use to query the current user id, so I just make use of that and compare the response to the id of the user that's experiencing a problem. Notice I use the fully qualified name of the class. I could have just written a security and it's very possible it would have worked, but it isn't guaranteed names can clash on the agent, and the agent side isn't aware of all the things we have in the iDe. As such, it's often a good practice to be more specific. After pressing ok, we see a special version of the snapshot icon with a question mark on it. This indicates that this action has a condition on it. Now it's just a waiting game for the user to hit that snapshot. This is the point where normally you can go make yourself a cup of coffee or even just go home and check this out the next day. That's the beauty of this sort of instrumentation. I can see the snapshot with a timeout of one week and just wait for users to reproduce the problem at their convenience. It's a pretty cool way to debug. In this case, we won't wait long. The snapshot gets hit by the right user despite other users coming in. This specific request is from the right user id. We can now review the stack information and fix a user specific bug. Countermetric lets us count the number of times a line of code is reached. You can use conditions to count things like how many people from a specific country reached a method. If you're thinking about deprecating a production API, this is the metric you need to keep your eye on. Notice that metrics are logged by default, but can be piped to statsd and Prometheus then visualized in Grafana. There are also metrics such as TikTok, method, duration, et cetera, but I don't have time to cover all of that. Thanks for bearing with me. Hope you enjoyed the presentation. Please feel free to ask any questions and also feel free to write to me. Also, check out debugagent.com, my book and my YouTube channel where I have many tutorials on these sorts of subjects. Thank you.

Slides

Download slides (PDF)

See all 54 talks at this event!

Conf42 Cloud Native 2023 - Online

March 30 2023

Debugging at Scale in Production - Deep into your Containers with kubectl debug and observability

Video size:

Abstract

Summary

Transcript

Slides

Shai Almog

Founder @ debugagent.com

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2023 - Online

March 30 2023

Debugging at Scale in Production - Deep into your Containers with kubectl debug and observability

Video size:

Abstract

Summary

Transcript

Slides

Shai Almog

Founder @ debugagent.com

Join the community!