Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. Today were going to talk about
polyglot polyglot polyglot polyglot cloudnative debugger beyond APM apms
we don't have much time so I'll get right to it. But first a few
things about me. I was a consultant for over a decade.
I worked at sun, founded a couple of companies, wrote a couple of books.
I wrote a got of open source code, and currently work as
a developer advocate for Lightrun. My email and Twitter
account are listed here, so feel free to write to me.
I have a blog that talks about debugging and production issues at talktodeduck
Dev. It would be great if you check it out and let me
know what you think. I love apms. They are
absolutely wonderful. I'm old enough to remember a time
where they weren't around and I'm so happy we moved past that.
This is absolutely amazing. The dashboards and
the details. You get this great dashboard with just everything you
need. Amazing. Were truly at a golden age of monitoring
hell. When I started, we used to monitor the server by kicking it and listening
to see if the hard drive was spinning properly. Today,
with kubernetes, the deployment scaled to such a level that
we need tools like this to get some insight into production.
Without an APM, we're, well, not as
blind as a bat, but it's pretty close. A lot of the
issues we run into start when we notice an anomaly
in the dashboard. We see a spike in failures or
something that performs a bit too slow. The APM
is amazing in showing those hiccups, but this is where it stops.
It can tell us that a web service performed badly or failed.
It can't tell us why. It can't point us at a
line of code. So let's stop for
a second and talk about a different line. This line. On the
one side we have developers, on the other side we have the ops
or DevOps. This is a line were had for
a long time. It's something we drew out of necessity because when
developers were given access to production,
well, I don't want to be too dramatic, but when developers got
access to production, it didn't end well. This was literally
the situation not too long ago. Yes, we had
sysadmins, but the whole process used to be a mess.
That was no good. We need a better
solution than this hard separation because the ops guys
don't necessarily know how to solve problems made by the developers.
They know how to solve ops problems. So when a
container has a problem and the DevOps don't know how to fix it.
Well, it starts a problematic feedback loop of
test, redeploy, rinse, repeat. That isnt ideal.
Monitoring tools are like the bat signal. They come up
and we, the developers, we're Batman or Batwoman or bat
person. All of us heroes,
we step up to deal with the bugs. We're the last
line of defense against their, well, villainy.
Well were coderbat people. It's kind of the same
thing without the six pack abs, too much baked goods,
you know, in the company kitchen here, coderbat man needs
to know where the crime or bugs are happening in the code.
So these dashboards, they point us toward the
crime we have to fight in our system. But here's
where things get hard. We start digging into the logs,
trying to find the problem. The dashboard sent us into a general
direction, like a performance problem or high error rates.
But now we need to jump into logs and hope that
we can find something there that will somehow explain the problem we're
seeing. That's like going from a jet engine back
to stone age tools. There are many logs processing
platforms that do an amazing job at processing these logs and
finding the gold within them. But even then it's a needle
in a haystack. That's the good outcome
where a log is already there waiting for us. But obviously we
can't have logging all over the place. Our billing will go through the roof
and our performance, well it will suffer.
We're stuck in the sloop of add a new log, go through
CICD which includes the QA cycle and everything.
This can take hours. Then reproduce the issue in production
server with your fingers crossed and try to analyze what
went on. Hopefully you found the issue because if not, it's effectively
rinse repeat for the whole process. In the meantime,
you still have a bug in production and developers are wasting their
time. There just has to be a better way.
It's 2022 and logs are the way
we solve bugs in this day and age. Don't get me
wrong, I love logs and today's logs are
totally different from what we had even ten near ago.
But you need to know about your problems in advance for
a log to work. The problem is, I'm not clairvoyant.
When I write code, I can't tell what bugs or problems
the code will have before the code is written. I'm in the same boat
as you are. The bugs don't exist yet. So I'm
faced with a dilemma of whether to log something. This is a bit
like the dilemma of writing comments does it make the code look
noisy and stupid? Or will I find this useful at 02:00 a.m. When everything
isn't working and I want to rip out a few strands of hair I still
have left because of this damn production issue.
Debugger are amazing. They let us set breakpoints,
see callbacks, call stacks, inspect variables, and more.
If only we could do the same for production
problems. But debuggers weren't
designed for this. They're very insecure when debugging
remotely. They can block your server while sending debugger commands remotely.
A small mistake has such as an expensive condition can
literally destroy your server. I might be repeating an urban
legend were but 20 or so years ago, I heard a story about
a guy who was debugging a railed system located on a
cliff. He stopped at a breakpoint during debugging,
and the multimillion dollar hardware fell into the sea
because he didn't receive the stop command. Again,
I don't know if it's a true story, but that's plausible.
Debuggers weren't really designed for these situations,
were. Debuggers are limited to one server.
If you have a cluster with multiple machines, the problem can manifest on
one machine always, or might manifest on a
random machine. We can't rely on pure lock.
If I have multiple servers with multiple languages, platforms crossing
from one to another with a debugger, well, it's possible in theory,
but I can't even imagine it in reality.
I also want to revisit this slide because I do
love having apms, and looking at their dashboard gives me that type
of joy we get from seeing the result of my work plotted out
as a graph. I feel there should be a german word to describe
that sort of enjoyment. But here's the thing.
Apms aren't one thing. The more you instrument,
the more you have runtime overhead. The more you have
runtime overhead, the more hosts you need to handle the same amount of
tasks. The more hosts you have, the more problems you have,
and they become more complex. I feel Schrodinger should
deliver this next line. By observing, we effectively change some
of the outcome. An APM needs to
observe everything. An APM doesn't know what it's
looking for. Like I said before, it's a bat signal or a check
engine light. It's got sensors all over
the place, and it needs to receive information from
these sensors. Some sensors have almost no overhead,
but some can impact the observed application noticeably.
Some people use that as an excuse to avoid apms.
Which I feel is like throwing away the baby with a bathwater. We need
apms. We can't manage at scale without them,
but we need to tune them. And observing everything isnt
an option. Thankfully, pretty much every APM
vendor knows that, and they all let us tune
the ratio of observability to performance so we
can get a good result. Unfortunately, that means we get
less data. Couple that with the reduced logs that
we need to do for the same reason, and the bad problems we
have in production just got a whole lot worse.
So let's take the Batman metaphor all the way.
We need a team up. We need some help from the server on the
servers, especially in a clustered polyglot environment where the
issue can appear on one container and move to the next, et cetera.
So you remember this slide. We need some way to get through that
line, not to remove it. We like that line.
We need a way to connect with the server and debug it.
Now, I'm a developer, so I try to stay away
from management buzzwords, but the word for this is shift left.
It essentially means we're letting the developer and the QA
get back some of the access we used to have into
the ops without demolishing the gains
we've had in security and stability. We love the ops people,
and we need them. So this is about helping
them keep everything running smoothly in production without stepping
on their toes or blowing up their deployment.
This leads us here. What if you could connect your
server to a debugger agent? That would make
sure you don't overload the server and don't make a mistake,
like setting a breakpoint or something like that.
That's what continuous observability does.
Continuous observability is complementary to the APM.
It works very differently. Before we
go on, I'd like to explain what's continuous observability.
Observability is defined has the ability to understand how
your system works on the inside without hoping
new code. The without hoping new code portion
is key. But what's continuous observability?
With continuous observability, we don't ship new code either,
but we can ask questions about the code. Normal observability
works by instrumenting everything and receiving the information.
With continuous observability, we flip that, we ask
questions and then instrument based on the questions.
So how does that work in practice?
Each tool in this field is different. I'll explain the lightrun
architecture, since that's what I'm familiar with, and I'll
try to qualify where it differs from other tools in
Lightrun. We use a native IDE plugin to vs code or Jetrain's
IDE, such has intellij, Pycharm,
webstorm, etc. It can also use a command line tool
or other tools sometimes have a web interface or CLI.
Only this client lets us interact with
the Lightrun server. This is an important piece of the
architecture that hides the actual production environment.
Developers don't get access to the production area,
which is still the purview of DevOps.
We can insert an action which can be a logs or a
snapshot or a measurement metric. I'll show all of these soon enough.
This talk will go into the code portions soon.
Notice that the Lartran server can be installed in
the cloud as a SaaS or on premise and managed
by Ops. The management server sends
everything to the agent which is installed on your
production or staging servers. This is pretty standard
for all continuous observability solutions. I don't know how
other solutions work, but I assume they are pretty similar. This means
there's clear separation between the developer and production.
As you can see, the DevOps still has that guarding line
were talking about. They need to connect the agent
to the management server and that's where their job ends.
Developers don't have direct access to production, only through
the management server. That means no danger to the running
production servers from a careless developer. Well, like myself,
the agent is just a small runtime you add to your production or staging server.
It's very low overhead and it implements the debugging logic.
Finally, everything is piped through the server back to your ide directly.
So as a developer you can keep working in the IDE without
leaving your comfort zone. Okay,
that should raise the vendor alert right here.
I heard that bullshit line before, right?
Apms have been around forever and have been optimized.
How can a new tool claim to have lower overhead than an established and
proven solution? As I said before,
apms look at everything. A continuous observability
tool is surgical. That means that when
an APM raises an interesting
issue, we can then look at a very specific thing, like a
line of code. When a continuous observability solution isn't
running, its overhead is almost nothing. It literally
does nothing other than check whether we need it. It doesn't
report anything and is practically idle when
we do need it. We need more data than the APM does,
but we get it from one specific area of the code. So there is
an overhead, but because it only impacts one area
of the code, it's tiny this is
the obvious question. What if I look at code
that gets invoked a lot? As I said, continuous observability
gets even more data than an APM does. This can
bring down the system and, well, we could end up
here. So this is where continuous observability
tools differ. Some tools provide the ability to throttle
expensive actions and only show you some of the information.
This is a big deal in high volume requests.
I'm going to show you two demos that highlight what we can do,
and the first is a simple hello world flask server.
So this is a simple hello world flask app which
is running in Pycharm. I'll demonstrate vs
code soon. First I right click and select
the log option in the menu. A log
lets me inject a new log into the code at
runtime without restarting the server.
But there is more. See here. I can log any
expression or variable from the currently running
app. In this case, I am logging the value of
name. Logs can appear in the console below,
or they can appear with the rest of the logs from the code.
Let's press the ok button which inserts
the new log. We can now see the dynamic log appearing just
above the line, as if it was a line we added into the
code. Now let's go to the browser window and hit refresh.
Then we go back to the ide and within a
matter of seconds we can see the log notice you can send
the log to the iD or to be integrated
with the other logs from your app.
Let's delete the log and select a snapshot instead.
A snapshot is kind of like a breakpoint you have in a regular debugger,
but it has one major difference, it doesn't break,
it doesn't stop the threads. When it's hit, it grabs the stack
information, variables, values, et cetera, but doesn't stop
the thread. So a user in production isn't
stuck because you're debugging. Let's go back to the web UI and
hit refresh button to see the snapshot in action.
Then we can go back to the ide and wait for the snapshot result to
arrive. Below you can see the result of the snapshot,
as is the convention Jetbrains ide. You can
walk the stack like you can with any breakpoint,
inspect variable values like you could with any debugger,
and all of that doesn't bother any live user in the
system. I skipped a lot of interesting features were such
as the ability to define conditional logs or snapshots,
which let you do things like define a snapshot
that's only hit when a specific user is in the system.
That means that if a user somewhere has a bug,
you can literally get information that's specific only to that user
and no one else. That's pretty cool.
Airflow lets you write workflows with Python
and execute them at scale. There are many
frameworks with similar concepts such as Spark,
etc. Logs of them have different
use cases and target demographics, but they have
one core concept in common. They launch workers that
are short lived. They perform a task and return a
response. In the case of airflow, this is commonly used
for processing data at scale. A good
example for this is tagging or clarifying
images. Here we have multiple independent
processes that can take pieces of data,
process it, and return a result. One of
the nice things about these tools is that we can create chains
of dependencies where results get passed from one process
to another to use computing resources
in the most effective way. But here's the problem.
This thing is nearly impossible to debug.
This is so bad. Companies just let major bugs
live in production and accept a pretty terrible error
rate because they just can't debugger this thing.
They have logs, but the FMRL processes,
they lose the context very quickly.
This is a perfect use case for continuous observability.
Tools that can deliver more
airflow lets you break down huge tasks like
clarifying a large set of images into distributed
workers that can run on multiple machines and use
available resources intelligently. This makes
it nearly impossible to debug. Your worker might
run somewhere and all you have is a log after
the fact, which you would need to dig through to check for
a bug or a fix.
This time I use vs code to demonstrate this functionality.
This is a simple airflow demo that classifies images.
The problem with airflow is that we don't have an
agent or a server on which our code is running.
An agent can come up, process and
then go away. By the time we set the snapshot
into place, it will be gone. This is where
tags come in. Tags let us apply an action such
as a log or a snapshot to a group of agents.
That means that every agent that logs with the given tag
will implicitly get the actions of that tag.
By the way, notice that in vs code we need to add
actions from the left pane. The UI
is a bit different here. Adding an action
to a tag is pretty similar to adding it to an agent.
We just add it and it looks the same so
far. Now that it's added,
let's move to the agents view and wait for the agent to
come online and trigger the snapshot. By the
way, notice that the UI for all of this is
similar in spirit to the one in Pycharm.
Now we have an agent that's running and we got a notification
that the snapshot was hit. Let's go into the snapshots
tab and click the snapshot. Unlike Pycharm,
we need to open the snapshot manually and it looks like
a vs code breakpoint, which is good as it's native to
the ide. But the core idea the
UI of the snapshot with the stack variables, etc. That's the
same as it was in Pycharm.
The title of this talk refers to polyglot debugging
because of time constraints. I can show the full polyglot
demo, but let's look at this simple kotlin prime number calculator
this is a simple Kotlin prime number calculator.
It simply logs over numbers and checks if they are a prime
number. It sleeps for ten milliseconds, so it won't completely
demolish the cpu, but other than that, it's pretty
simple. Pretty simple application. It just counts
the number of primes it finds along the way and prints
the result at the end. We use this code
a lot when debugging, since it's cpu intensive and yet
very simple. In this case, we would like to observe
the variable I, which is the value we're evaluating
here, and print out CnT, which represents the number
of primes we found so far. The simplest
tool we have is the ability to inject a log into an application.
We can also inject a snapshot or add metrics. I'll discuss
all of those soon. Selecting log
opens the UI to enter a new log.
I can write more than just text in the curly braces.
I can include any expression I want, such as the
value of the variables that I included
in this expression. I can also invoke methods and do all
sorts of things. But here's the thing. If I invoke a
method that's too computationally intensive,
or if I invoke a method that changes the application state,
the log won't be added. I'll get can error.
After clicking OK, we see the log appearing above
the line in the ide. Notice that this behavior is specific to
intellij or Jetrain's files. In visual studio
code, it will show a marker on the side.
Once the log is hit, we'll see logs
appear in batches like before. You'll notice that the experience
is pretty much identical to the one we had
with Python. The next thing
I want to talk about is metrics,
and this is a different demo that I usually use to
show metrics. It's based on Java actually fitting
the polyglot stuff.
Apms give us large scale performance information, but they
don't tell us fine grained details.
Here we can count the number of times a line of code was reached
using a counter. We can even use a condition
to qualify that, so we can do
something like count the number of times a specific user
reached that line of code. We also have a
method duration, which tells us how long
a method took to execute. We can even measure the
time it takes to perform a code block using a
TikTok. This lets us narrow down the performance
impact of a larger method to a specific problematic segment.
In this case, I'll just use the method.
Duration measurements typically have a name
under which you can pipe them or log them,
so I'll just give this method duration a clear name.
In this case, I'm just printing it out to the console,
but all of these measurements can be piped to statsd and
Prometheus.
I'm a pretty awful DevOps, so I really
don't want to demo that in this case, but it does work if
you know how to use those tools. As you
can see, the duration information is now piped into the logs and
provides us some information on the current performance of
this method.
In closing, I'd like to go over what we discussed here,
and a few things we didn't have time for.
Lartran supports Java languages like Java,
Kotlin, Scala, etc. Every JVM language
is supported. It supports node
for both JavaScript and typescript code, and of course
Python, even complex stuff like airflow.
We're working hard on adding new platforms that are
going to doing that really fast. So if
you want a new platform I didn't mention here,
just write to me and I'll connect you with the product team.
You can become a better tester for the new platform and have an impact on
the direction we take when we add actions.
Conditions run within a sandbox so they don't take
up cpu or crash the system. This all happens
without networking, so something like networking hiccup won't
crash the server. Security is especially crucial
with solutions like this. One of the core concepts is
that the server queries information,
not the other way around. As you would see with solutions such as JDWP.
This means operations are atomic and the server can be hidden behind
a firewall, even from the rest of the organization.
PIi reduction lets us define conditions that would
obscure patterns in the logs. So if a user could
print out a credit card number by mistake, you can
define a rule that would block that.
Pii reduction lets us define conditions that would obscure
patterns in the logs. So if a user could print
but a credit card number by mistake, you can
define a rule that would block that. This way the bad
data won't make it into your logs and won't expose you to liability.
Blacklisting lets us block specific classes, methods or
files. This means you can block developers in your organization
from debugging specific files. This means a developer won't
be able to put a snapshot or a log in a place
where a password might be available to steal user credentials or stuff
like that. This is hugely important in large organizations
besides the sandbox.
I'd like to also mention that Lightrun is very efficient and
in our benchmarks has almost no runtime impact.
When it isn't used, it has a very small impact
even with multiple actions in place. Finally,
Lightrun and can be used from the cloud or using an on premise install.
It works with any deployment you might have, whether cloud
based, container based on premise, microservice, serverless, etc.
Thanks for bearing with me. I hope you enjoyed the presentation.
Please feel free to ask any questions and feel free to write to
me. Also, please check out talktoduck dev were I talk
about debugging in depth and check out.