Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Shai Almog. Today we'll talk about
debugging at scale in production, specifically about kubernetes
debugging. I'm assuming everyone here knows the basics
of kubernetes, so I will dive right into a basic
problem description. But first, let me say a couple of things about me.
I wrote a few books, including one about debugging, which covers
all of the topics we'll discuss here. I've worked in
this industry for decades, in many companies,
and as a consultant. You can contact me over my socials
listed here and follow me on LinkedIn, Twitter,
etc. My dms are open, but please try just
tweeting at me and if I don't answer, I just missed it with
all the flood messages I keep getting, so please try again.
Also, I'm on mastodone, et cetera, so you can reach
me there too. This is my Aprils book titled
Practical Debugging at Scale. I've put a lot of work into
this book and I can guarantee there's nothing out there like
it. I hope you check it out.
Everything in today's talk is there, and a
lot more. I gave a version of this talk in
the past. After I was done. They generated this absolutely
spectacular mind map of my presentation.
This is about an older version of this talk, but I
still like this a lot. I wish I could draw stuff like this. I still
have it because it's pretty close to the new talk.
One of the first things I hear when I talk to some developers
is why debug production?
Isn't that the problem of the DevOps team?
It's like they're a part of a completely different
universe. We write the software and they
make it run. We live in silos where production
isn't our problem. It's that other monkey
that's responsible for that, not me.
What we're building here isn't perfectly
running unit tests. It's a product that's running
in, you guessed it, production.
The thing is that when we say failure,
we aren't talking about a crash. This is one
of those things that we seem to associate with production bugs,
even though that's a small part of the problem.
When a debug gets into production, it's often the
worst kind of debug, only happens to
user X when the moonlight hits the water
at 37.5 degrees on a Tuesday.
To make it worse, I don't think I'm breaking any news by
saying that kubernetes debugging is hard. I'd argue
this is an understatement, at least for some of
the problems we run into. We will just use exec
to log in to a container and see what the hell is going on.
This is like working with my local shell,
which is pretty great, but it has some problems.
If I'm debugging a local install of kubernetes,
that's okay, but even if I'm tracking an issue
on staging, I don't want to disrupt other developers working
on this. I also want to use the best practices
because to me staging is really a grand
rehearsal to production. So we want to make
the best practices we have there and
take the best practices we have there and use them
as much as possible. The most obvious problem
with exec is that when we log into a
server we don't have various tools. If I want
a particular tool, I need to install it. This can work to
some degree, but it creates baggage dependencies that I fetch
might break the processes running on the container and
disrupt everything. That might be okay in staging,
but in production we really should never do that.
This is if the containers is even something
I can physically work with. It can be a skeleton containing
nothing to keep it as minimal as possible. That's a
great practice to reserve resources, but it makes it
pretty hard for an old Unix guy like me
to track an issue. How can I run stack twice now?
S trace now when a containers crashes
I have nothing. How can I log to that and
work from there? This is problematic.
We can use Kubectl debug in much
the same way as we use Kubectl exec.
It effectively opens for us a command prompt on a
container, but it isn't our container.
The ephemeral container is the kaiser soze
of containers. It wasn't invented by Kubectl Debug.
We had them around before, but they were a pain to deal with.
Kubectl Debug makes it easy to spin up a pod
next to our pod that lets us reproduce the issues and track the
problems. We can install and preinstall anything we
want. It will be gone completely once the pod
is destroyed. We can share process space and
file system with the pod, which means we can look into
a crashed container and see what happened there.
Since we're technically not inside the container,
we can peek without damaging. We can use various
base images which already have software
that we want pre installed. This makes
it easier for us to debug quickly.
We might be looking at a container running alpine,
but run Ubuntu on our own container when
I worked at lighter and we launched the coolkits project,
which was an uber container with everything you
might need to debug a container, you might find that
useful. So here's the first problem
we run into with a debugger. We can't just
start using it, we need to relaunch the app.
With debugging enabled, that means killing the existing process and running
it over again. That might not be something
you can just do. Furthermore, running in
this mode is a big security risk.
Notice I limit access to only the local server,
but still, it's a big risk. Leaving a remote
debugging port enabled in deployed server code is considered a
huge security vulnerability. If hackers can ride
a different vulnerability, they might be in a position to leverage
this from a local system. Still, if we
do this for a short period of time, this might not
be a big deal, right?
In a different window, I need to find the process id
for the application I just ran so I can
connect to it. I can now type it into the JDB
command, and now I'm connected with a debugging.
I can add a breakpoint using the stop at command.
Naturally, I need to know the name of the class and
the line number so I can set the breakpoint. Once stopped,
I can step over like I can with a regular debugger.
However, this is a pretty problematic notion on
multiple fronts. First off, I'm literally stopping
the threads accessing this application. That's probably not okay
on any container you have in the cloud. There are ways around
that, but they aren't trivial. The second problem is
different. I'm old and a Unix geek,
so people automatically assume I love the command line,
and I do to some degree, but I love Gui
more. And when I started programming,
there was no option. We didn't have ids on
a sinclair, an Apple two, or a PDP eleven.
But now that we have all of those things,
I don't want to go back. I programmed Java since
the first beta, and this was actually the first time
I used JDB. I'll use command
line tools when they give me power, but debugging
via commands, I just can't.
The obvious answer is JDWP.
We have a remote debug protocol that's supposed to
solve this exact problem, right?
But this is problematic. If we open
the server to remote access with JDWP, we might as
well hand the keys to the office to hackers.
A better approach is tunneling. During the age
of vps, we could just use Ssh tunneling like
this we'd connect to a remote host and forward
the port where the debugger was running locally.
Notice that in this sample I used port 9000 to
mislead hackers scanning port 5005,
although it wouldn't matter because it's ssh.
We can do the exact same thing with Kubernetes using the port forward
command to redirect a remote JDWP
connection to localhost. Port forwarding
opens a secure tunnel between your machine and the remote machine on
the given port, so when I connect to localhost on the forwarded
port, it seamlessly and securely connects to the remote
machine. Once we do that, I can just
open Intellij idea and add a local configuration for
remote debugging, which already exists and is
pre configured with the defaults such as port 5005.
I can give the new run configuration a name
and we're ready to go with debugging the app.
Notice I'm debugging on localhost even though my
pod is remote. That's because I'm port forwarding
everything. I make sure the right
run configuration is selected, which it is.
We can now just press debug to instantly connect
to the running process once it is done.
This feels and acts like any debugger instance
launched from within the Ide.
I can set breakpoint and step
over and get this wonderful debug gui I'm pretty much used
to. This is perfect, right?
The first thing we saw when debugging was the
need to restart the process. That isn't practical production.
Unfortunately, we can't leave JDWP just
running to debug a real world project.
I usually say that JDWP isn't secure,
isn't secure, this isn't the case,
this isn't correct. It's a wide open door.
There's no security to speak of.
Then it's breakpoints they break. I heard a story years ago
about a guy who debugged a rail system and it literally fell into the ocean
while he was stopped on a breakpoint because he didn't get the
stop command. I don't know if that's an urban legend,
but it's totally plausible. Remote debugging APIs
are a stability risk. How many times
did a debug process crash on you? Imagine adding a
breakpoint condition that's too expensive or incorrect.
It might destroy your app in production. Just loading
the variables in the watch can cause unforeseen problems
in some cases, but this is the
absolute worst. Say your
app has a place where it takes user credentials
for login, maybe a third party library
you use does that. You can
still place a breakpoint there and
steal user login details.
Worse, you can elevate the permissions
of your own user account because everything
is writable. Just set a value of a variable.
This goes against privacy laws and is very damn
likely 60% of security breaches in organizations
happen because of an employee. There's often a
misconception of security that targets the
outside world only. This just
isn't true. This isn't the
big picture though. These are all problems
that we would potentially have when
running one server or our own hardware
connected to the Internet. Kubernetes brings
scale, and with scale we have additional problems that
we don't normally face. People complain
about debugging, multiple threads do
that, while some of the threads are in
the other side of the world.
So this is a typical Kubernetes high level architecture
then. Our code as developers is this tiny
area that isn't even covered by kubernetes.
We have this huge infrastructure to run our code,
but once it's running somewhere, and by somewhere I mean
literally I have no idea where,
then I'm on my own. Kubernetes will help with running it
and guarantees, but it won't help me debug the code
and adds a lot of layers between me and my code.
Observabilit is the new debugging, which I hope
is pretty obvious to most of us. But it has
many pitfalls and many limitations when compared
to real debugging. The cost of observability
is one
of the things that we were just starting to understand.
Logs alone can be 30% or more
of your total cloud spend. That's absolutely insane.
The cloud was supposed to reduce costs.
Instead the reverse is true. The most problematic
aspect is that most of the observability technology will work with
is geared towards ops and less towards r
and D. It might say that a particular endpoint is experiencing
issue, but it won't point at
a line of code or give us the means
to debug it. If we have a bug lock,
a specific user that sees wrong
information, that's common and can
happen because of problematic flushing of caches,
how do we debug something like that?
We need to create observability dashboards
to the r and D team, involvement in
the day to day observability tasks is a must.
Reading production logs shouldn't be segregated to
a separate team. I'm not saying that SRE
shouldn't exist. I'm saying that we need vertical teams
where the SRE is embedded within them.
We should shift in both directions and have a
wider scope of responsibilities to qualify
to quality and production.
Debugging shouldn't stop at the boundaries of
the CI process. Developer observability
stands for a new generation of tools geared
to bring observability changes to the developer community.
What makes them different is that they
offer source based solutions. That means we
work with developer terminologies like
line numbers and source code.
I'll demonstrate light run because I'm familiar with that,
but there are other tools in the market. I used to work for Lightran
and wrote a lot of what you'll see, but I no longer
work for them. There are plenty of other solutions in the market with
different degrees of functionality. I didn't compare them because it would
be futile. The market changes too quickly.
I hope you'll get a sense of what's possible thanks to this
demo. On the left side is intellij
idea, which is my favorite IDE. Some developer observability
tools integrate directly into the IDE, which is convenient to
developers as that's where we spend
our time. Other tools have web based interfaces, et cetera.
On the right side I have an application that counts
the prime numbers running on a remote server. We can see
the console of that demo. The application doesn't
print any logs as it does the counting,
which makes it hard to debug if something didn't work
there. In the side of the ide we
can see the currently running agents
which are the server instances. We also see
the tags above them. Tags let us apply an action to
a group of server processes. If we have 1000 servers,
we can assign the tag production to 500
of them and then perform an operation on all 500
by performing it on a tag. A server can
have multiple tag designations such as East Coast,
Ubuntu 20, Green, et cetera. This effectively
solves the scale problem typical debuggers have.
We can apply observability operations to multiple servers.
Here I have only one tag and one server process because
this is a demo and I didn't want to crowd it.
Another important aspect to notice is the
fact that we don't see the actual server.
This all goes to management servers,
so production can be segregated behind a
firewall and I don't have
direct access to the physical production server. This is important.
We don't want R and D developers to have actual access
to production. I can see that a server process is running.
I can get some information about it, but I have no direct line to the
server. I can't ssh in and I can't change anything in
it. I can add a new log by right
clicking a line and adding it. I ask it
to log the value of variable I and
it will just print it to the application logs.
This will fit in order with the other
logs, so if I have a
log in the code, my added log will appear as
if it was written in the code next to it. They will get
ingested into services like elastics seamlessly,
or you can pipe them locally to the IDE. So this
plays very nicely with existing observability while
solving the fact that traditional observabilit isn't
dynamic enough. The tools complement each other,
they don't replace one another. Notice I can include complex
expressions like method, invocations, et cetera, but Lightran
enforces them all to be read only.
Some developers,
some developer observability tools do that while others don't,
but the thing I want to focus on is this.
Notice the log took too much cpu and Litran
pauses logging for a bit so it won't destroy the
server performance. Logs are restored automatically a
bit later when we're sure cpu isn't depleted.
Snapshots are breakpoints that don't stop. They include
the stack, the variable values, and all the stuff we need.
We can use conditions on snapshots, on logs, and on
metrics, just like we can in a regular debugger. With a
conditional breakpoint, we can apply everything to a tag which
will place it in multiple servers at once. We can then
inspect the resulting data like we would any breakpoint.
We can click on methods and view variables. We can step
over, but our app will never get stuck.
The demo was a bit simplistic. Here is
a different demo that's a bit more elaborate. In this
demo I will add a new snapshot, and in it I have an
option to define quite a lot of things. I won't
even discuss the advanced version of this dialog in this session,
but I will add a condition for the
snapshot. This is a really trivial
condition. We already have a simple security
utility class that I can use to query the current user id,
so I just make use of that and compare the response
to the id of the user that's experiencing a problem.
Notice I use the fully qualified name of the class.
I could have just written a security and
it's very possible it would have worked, but it isn't
guaranteed names can clash on the agent,
and the agent side isn't aware of all the things we have in the
iDe. As such, it's often a good practice
to be more specific.
After pressing ok, we see a special version of the
snapshot icon with a question mark on it. This indicates
that this action has a condition on it. Now it's just a
waiting game for the user to hit that snapshot. This is
the point where normally you can go make yourself a cup of coffee
or even just go home and check this out the next day.
That's the beauty of this sort of instrumentation.
I can see the snapshot with a timeout
of one week and just wait for users to reproduce
the problem at their convenience. It's a pretty cool
way to debug. In this case, we won't
wait long. The snapshot gets hit by the right user despite
other users coming in.
This specific request is from the right user id.
We can now review the stack information and fix a
user specific bug. Countermetric lets
us count the number of times a line of code is reached.
You can use conditions to count things like how many people
from a specific country reached a method.
If you're thinking about deprecating a
production API, this is the metric you need to keep your eye
on. Notice that metrics are logged by default,
but can be piped to statsd and Prometheus then
visualized in Grafana. There are also metrics
such as TikTok, method, duration, et cetera,
but I don't have time to cover all of that.
Thanks for bearing with me. Hope you enjoyed the
presentation. Please feel free to ask any
questions and also feel free to write to me. Also, check out debugagent.com,
my book and my YouTube channel where I have many tutorials on
these sorts of subjects. Thank you.