Abstract
Brian Kernigham said: “Debugging is twice as hard as writing the code in the first place.”
In fact, debugging in a modern production environment is even harder - orchestrators spinning containers up and down and weird networking wizardry that keeps everything glued together, make understanding systems that much more difficult than it used to be.
And, while k8s is well understood by DevOps people by now, it remains a nut that developers are still trying to crack. Where do you start when there’s a production problem? How do you get the tools you’re used to in the remote container? How do you understand what is running where and what is its current state?
In this talk, we will review debugging a production application deployed to a Kubernetes cluster, and review kubectl debug - a new feature from the Kubernetes sig-cli team. In addition, we’ll review the open source KoolKits project that offers a set of (opinionated) tools for kubectl debug.
KoolKits builds on top of kubectl debug by adding everything you need right into the image. When logging into a container, we’re often hit with the scarcity of tools at our disposal. No vim (for better or worse), no DB clients, no htop, no debuggers, etc… KoolKits adds all the tools you need right out of the box and lets you inspect a production container easily without resorting to endless installation and configuration cycles for each needed package.
We’ll finish the talk by delving into how to get better at debugging on a real-world scale. Specifically, we’ll talk about how to be disciplined in our continuous observability efforts by using tools that are built for k8s scale and can run well in those environments, while remaining ergonomic for day to day use.
This session will go back and forth between explanation slides and demonstration of the topic at hand.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Shai Almog. Today we'll talk about debugging
at scale in production, specifically about kubernetes
debugging. I'm assuming everyone here knows the basics of kubernetes,
so so I will dive right into the basic problem description and
then into the three tools I will show today. But first,
a few things about me I was a consultant for over a decade.
I worked at sun, founded a couple of companies, wrote a couple of books,
wrote a lot of open source code, and currently work
as a developer advocate for Lightrun. My email and Twitter
accounts are listed here, so feel free to write to me. I have a
blog that talks about debugging and production issues at talktodeduct
dev. It would be great if you check it out and let me
know what you think. I also have a series of videos on
my Twitter account called 142nd Ducklings
where I teach complex things in 140 seconds.
The current series is based on debugging and it covers a lot of
things most developers don't know and would find helpful.
Containers and orchestrators revolutionized development and production,
no doubt. But in a way, kubernetes made
debugging production issues harder than previously was.
In the past we had physical servers we could just work
with, or even vps. Now we
face much greater difficulty in this age due to those big
challenges. The massive scale enabled by
Kubernetes is a huge boon, but it also
makes debugging remarkably difficult.
We need new tools to deal with those scale.
We now have multiple layers of abstraction
to the deployed where failures can happen in the orchestrators
container or code layers, each failure
requires a different set of skills and solutions.
Tracking the cause to the right layer isn't
necessarily trivial. Finally, there's the
bare bones or lean deployment problem. When debugging,
this is the first problem I want to focus on. We'll get
to the other two soon enough.
It's the problem of those bare naked container.
We can connect to a bare bone container, but we have
nothing to do inside it. Nothing is installed. We can
inspect logs, but that relies on luck.
Furthermore, if your logs are already ingested by a
solution like elastic, you probably don't have anything valuable
to do within a bare bones container.
Kubectl debug solves these problems and can work
even with a crashed container or a bare bones image.
The Kubectl debug command adds ephemeral
containers to a running pod. An ephemeral
container is a temporary element that would vanish once
the pod is destroyed. With it we can inspect
everything that we need in the pod. Most changes we do in it
don't matter, they won't impact the pod after
we're gone. It works with bare bones
container. The way it does this is with a separate image
so we can have an image that includes everything
in it. The container spun from the image is
ephemeral and can include a proper distro
and the set of tools we need. Kubectl debug
was introduced in version 1.23 so if youre
still in an older version you will need to wait for that.
If you use hosted kubernetes you need to check those
version they use. Let's start with
a simple demo. As you can see we have here a few pods.
We're experiencing an issue and would like to
increase the logging level so we can better see what's going on.
I can use exec and log in directly
to the live pod. I log in with proper bash shell.
I'm sure most of you did that in the past as it's
pretty easy here I can just use standard commands
like cat and grep. Check the logging level. This is all
good. We can see the current level is at info.
Unfortunately I don't even have vim if I want
to edit this file now when the
container has apt or APK and
when those pod isn't crashed I could in
theory just apt get install vim, but that's
got its own problems and is a painful process.
We don't want people in production just installing packages
left and right, even if it's cleaned up. The risks
are sometimes tools high pods shouldn't
be touched in production. Once deployed,
all the information and state of a pod
should be described in its manifest, unless it's
strictly a statewall pod, like a database, et cetera.
So installing like this is problematic.
So let's exit for a moment and try connecting again
with Kubectl debug. I'll use the busybox
image in this case and we'll see how that works.
Notice that this is the image referenced in
those Kubectl debug docs. So I'm
connected to the pod again, or so it seems.
Technically I'm connected to a new ephemeral
container and not to the pod directly.
This is an important distinction, but as you
can see again, I don't have Vim or really any of
the tools I would expect like visual, vm,
trace, youre, et cetera. I can fix that. I can
create an image that represents everything I
need and packages all of it all of the tools
I need. Then pass that image to Kubectl debug
and just use these tools. But here's
the thing, I'm not unique. We're all pretty much the
same. We all expect the same thing
in our debugging sessions. And that image I can
use is probably the same image you would use tools,
so why not have one generic image?
This is where koolkits comes in. Coolkits is an
open youre project that includes a set of opinionated,
curated, platform specific tools for Kubectl debug,
so you can have everything you might need
at your fingertips while debugging.
So what does this mean when you use
Kubectl debug to spin up an ephemeral container?
It's built using a coolkit's image.
Currently there are four standard images,
a go image that includes tools such as Delve,
Pprof, Go, Calvis, and many others the
JVM toolkit, which includes tools such as Sdkman,
jmxterm, honest profiler, visual, vm and much
more. The node version includes NVM,
NDB, zero, x, vtop and much more.
And finally the Python version includes pyn,
IPDb, ipython and much more.
But this is just the tip of the iceberg as all the versions
include the many tools you would expect in any proper debugging
session such as Vim htop. Also, we also have
lots of networking tools like traceroute NMaP database
clients for postgres mysql
redis again so much more.
So let's continue from where we left off in
those demo. We can disconnect from the current session and then
spin up a new session with the Koolkits image.
Notice we can also use the shorthand KK
command for many operations which I don't use here,
but you can see the syntax in the Koolkits docs specifically
in this case. Notice I use the JVM version of
coolkits, which I chose because I'm a Java guy.
But if you're using a different environment, you can use what fits
there. In coolkits, pretty much every tool I want
is already pre installed as part of the image by default.
This means we can just connect and everything is already
there. Since were all very similar in our needs.
Koolkits includes the common things most of us
need based on the platform or language. It has
sensible defaults and comes with ubuntu as the
distribution. This is important. You have a
full distribution like would have in a desktop or
in a regular server. This is very helpful for debugging
so you get everything you need even when debugging a
bare bones container. Notice that thanks to Kubectl
Debug, we have full access to the main application,
main application's container, file system and pods process
namespace so we can do everything there
while residing in a more convenient environment
and having our cake and eating it.
So to finish the story from before, I can just use vim
to edit the file and change the logging level to error, which I can
then confirm using cat and grip. I can also do
a lot of other things such as profile using a profiler,
debug using with GDB or JDB.
I can use Jmxterm to perform JMX operations,
which lets you configure the way the JVM behaves in runtime
and pretty much anything I can do with a local machine.
To give you a sense of what Koolkits installs,
this is the list of packages for the JVM clients
and this is bound to grow as you can all submit pull requests
with your favorite packages. This is just those
JVM specific image. The other images contain
similar tools at a similar scale, and you
can get all that thanks to Kubectl, debug and koolkits.
I didn't forget about this slide. How do we debug
issues of massive scale and massive depth?
Instead of talking theory, let's talk about
real world example. What if
what we're tracking seems to be an application debug?
This is a common occurrence for sure. We might not know it
at this stage, but that might be the place we want
to investigate. We can use one of the debuggers in
koolkits to track it, but that would only work if
we know the server where the issue manifests.
It's also remarkably risky. Connecting a debugger to
a production environment can lead to multiple problems,
stopping on a breakpoint by accident using a
conditional statement that grinds the system to a
halt, exposing a security vulnerability.
JDWP itself is literally an open invitation
for hacking. We can try using logs
and probably should start there, but more often than not,
the issue we needed isn't logged.
We can try using various observability tools.
They're great, but not for application level
issues. They rule for big picture analysis
and container level problems, not for application level problems.
We used to call this continuous observability, but developer observability
makes more sense. It's a newer set of tools designed
to solve this exact problem observability is defined
as the ability to understand how your system works on
the inside without chipping new code. The without chipping
new code portion is key, but what's developer
observability? With developer observability,
we don't ship new code either. We can ask questions about
the code. Normal observability works by instrumenting
everything and receiving the information. With developer
observability, we flip that, we ask questions and
then instrument based on those questions. So how does
that work in practice? In practice, we add an agent
to every pod. This lets us debug the source code
directly from the IDE, almost like debugging a
local project. To start,
we need to sign up for a free lightrun account@lightrun.com.
Slash free notice that Lightrun has a very generous free
tier you can use to experiment with the product. Pretty much
everything I show here can be accomplished with a free account.
I'll skip the actual setup since it's covered there and we
don't have much time. You can check out the lightrun
docs for more detailed instructions on setting lightrun in
docker, minikube, etc. Unlike Kubectl debug,
we need to install the agent before the problem
occurs. So if we do run into a problem,
we will be able to skip right in. Let's skip
right ahead to a simplified demo right
in the id once the agent is set up,
this is the prime main app in kotin. It simply
loops over numbers and checks if they are a prime number.
It keeps for ten milliseconds, so it won't completely demolish
the cpu. But other than that, it's a pretty simple
application. It just counts the number of times it finds
along the way and prints the results at the end.
We can use this code we use this code a
lot when debugging, since it's cpu intensive and very simple.
In this case, we would like to observe the variable I,
which is the value we're evaluating here, and print
out CNT, which represents the number of primes we found
so far. The simplest tool we have is the
ability to inject log into the application. We can
also inject a snapshot or add metrics.
I'll discuss all of those soon enough.
Selecting log opens the UI to enter
a new log. I can write more than just text
in the curly braces. I can include any expression I want,
such as the value of the variables that
I included in those expression. I can also invoke methods
and do all that sort of thing. But here's the thing
if I invoke a method that's too computationally
intensive, or if I invoke a method that changes the application state,
the log won't be added. I'll get an error.
After clicking OK, we see the log appearing above the
line in the ide. Notice that this is behavior specific to
intellij or jetbrain's ides. In visual studio code,
it will show a marker on the side. Once the
log is hit, we'll see logs appear in batches.
Notice I chose to pipe logs to the id for convenience,
but there's more I can do with them. For now, the thing
I want to focus on is the last line. Notice that the
log point is paused due to high core rate. This means
the additional logs won't show for a short period of time.
Since logging needed threshold of cpu usage,
this can happen quickly or slowly depending on how you're observing.
Let's move on to a different demo. This is the node
JS project that implements the initial backend of a microservice
architecture. This is the method that gets invoked when we
click a movie and want to see the details.
This time I'll add a snapshot. Some other developer
observability tools call this a capture or a nonbreaking
breakpoint, which to me sounds weird built. The idea is usually the
same. Once I press ok, the camera button appears on
the left, indicating the location of the snapshot like you would see
with a regular ide breakpoint. Now I'll just
access the production in
the front end that triggers this code. And now
we wait a second and the snapshot is hit.
So what is the snapshot? It gives us a
stack twice and variables, just like a regular breakpoint we
all know and love. But it doesn't stop at
that point, so your server won't be stuck waiting for
a step over. Now obviously youre can't step
over the code, so you need to step by individual snapshots.
But this is a huge benefit, especially in production scenarios.
But it lets much better this was
relatively simple in terms of observability. Let's up the
ante a bit and talk about user specific problems. So here
I have a problem with those request one specific user is
remaining that the list on his machine doesn't match the
list for his peers. The problem is that if I put a
snapshot I'll get a lot of noise because there are many users
reloading all at the same time. So a solution to
this is to use conditional snapshots just like youre
can with a regular debugger. Notice that you can
define a condition for a log and for metrics
as well. This is one of those key features of
continuous observability. I add a new snapshot
and in it I have an option to define quite a lot of things.
I won't even discuss the advanced version of this dialogue
in this session. This is a really trivial
condition. We already have a simple security utility class that
I can use to query the current user ID, so I
just make use of that and compare the response
to the ID of the user that's experiencing a problem.
Notice I use the fully qualified name of the class.
I could have just written security and it's
very possible that it would have worked, but it isn't
guaranteed names can clash on those agent and
the agent slides isn't were of things we have in the IDE. As such,
it's often good practice to be more specific. After pressing
ok we see a special version of the snapshot icon with
a question mark on it. This indicates that this action has
a condition on it. Now it's just a waiting game for the
user to hit the snapshot. This is the point were
normally you go make yourself a cup of coffee or just go home
and check this out the next day. That's the beauty of
this sort of instrumentation. In this case, I won't
make you wait long. The snapshot gets hit by the
right user despite other users coming in. This specific
request is from the right user id. We can now review
the stack information and fix the user specific bug.
The next thing I want to talk about is metrics.
Apms give us large scale performance information,
but they don't tell us fine grained details.
Here we can count the number of times a line of code was
reached using a cluster. We can even use a configuration
to qualify that, so we can do something like count the number
of times a specific user reached that line of code.
We also have a method duration, which tells us how long
a method took to execute.
We can even measure the time it takes to perform a code block
using a TikTok. That lets us narrow down the performance
impact of a larger method to a specific problematic segment.
In this case, I'll just use the method. Duration measurements
typically have a name under which we can pipe them or
log them. So I'll just give this method duration a
clear name. In this case I'm just printing it out to the console.
But all these measurements can be piped to stacksd and
Prometheus. I'm a pretty
awful DevOps so I really don't want to demo that
in this case, but it does work if youre know how to use
these tools. As you can see, the duration
information is now piped into the logs and provides us
with some information on the current performance of the
method. The last thing I want to talk about
brings this all together and that's tags.
We can define tags to group agents together, such as production
green, blue, ubuntu, et cetera.
Every pod can be a part of multiple tags.
Every action we discuss today can be applied to a tag
and as such can run on multiple machines simultaneously
and asynchronously. This solves the scale problem
when debugging. So in closing, I'd like to
review some of the things we discussed today.
Kubectl debug made debugging
crashed pods possible it also made it possible to
debug a pod based on a bare bones image.
Koolkits made Kubectl debug easier to use with
preinstalled tools. Lightrun made
keeps secure, readonly real time debugging at scale
easy thanks for bearing with me.
I hope you enjoyed the presentation. Please feel free to
ask any questions and also feel free to write to me.
Also, please check out talktodeduct dev where I talk
about debugging in depth. And check out lightrun.com,
which I think you guys will like a lot. If you
have any questions, my email is listed were and I'll be
happy to help. Thanks for watching.