Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. Today we're going to talk about debugging node
js applications in production with Lightrun.
But first, a few things about me I was a consultant for over
a decade. I worked at sun, founded a couple of companies,
wrote a couple of books. I wrote a lot of open source
code, and currently work as developer advocate for Lightrun.
My email and Twitter accounts are listed here, so feel free to
write to me. I have a new blog that talks about
debugging and production issues at talktotheduck dev.
It would be great if you check it out and let me know what you
think. I want to put my heart out there.
I love apms. They are absolutely wonderful.
I'm old enough to remember a time where
they weren't and weren't around and I'm just
so happy that we moved past that.
These is absolutely amazing. The dashboards
and the details. You get these
great dashboard with just everything you need.
Amazing. We're truly at the golden age of monitoring
hell. When I started, we used
to literally monitor the server by kicking it and listening to see if the hard
drive was spinning properly. These vendors
provide amazing dashboards that let you see every stat
that matters in your app. Monitoring is absolutely
amazing. And if you don't use these tools on your production servers,
well, you're missing out and you might have issues you're
unaware of. In fact, a lot of these issues we run
into start when we notice an anomaly in the dashboard.
We see a spike in failures or something
that performs a bit too slow. The APM
is amazing in showing those hiccups, but this is where
it stops. It can tell us that a web service performed
badly or failed. It can't tell us why.
It can't point at a line of code that
we need. So let's stop for a second and talk about a
different line altogether. This line on
one side we have developers,
on the other we have ops or DevOps.
This is a line we've had for a long time.
It's something we drew out of necessity because when developers
were given access to production, well,
I don't want to be too dramatic, but when developers
got access to production, it didn't end well.
This was literally the situation not too long ago.
Yes, we had sysadmins, but the whole
process used to be a mess. That was no good.
But we need a better solution than this hard separation
because the ops guys don't necessarily know how
to solve the problems made by the developers.
These know how to solve ops problems because
this is really what those monitoring tools are all about.
They're the bat signal. They come up and
we're Batman or Batwoman or Bat person.
All of us damn heroes in these cases,
we step up to deal with the bugs and we're the last
line of defense against their villainy.
DevOps have enough of villains of their own
that they need to deal with for this case.
They need us, but, well, we're coder.
But people, it's kind of the same thing, just without the six pack
abs, too much baked goods. And in the company kitchen over
here, coder Batman needs to know these the
crime or bugs are happening in the code.
So those dashboards, these point us in the general direction
these the crime is, but not at
the exact specific location where we need to fight crime.
And here's where things get hard. We start digging into
logs, trying to find the problem. The dashboard sent us into a
general direction like performance problem or high error rates.
But now we need to jump into logs and hope that we can
find something there that will somehow explain the problem we're
seeing. That's like going from a jet
engine back to the Stone age tools. Now there are
many log processing platforms that do amazing job at
processing these logs and finding these gold within these.
But even then it's a needle in a haystack. And that's
the good outcome where a log is already
there waiting for us. But obviously we can't have logging
all over the place. Our billing would go through the roof and our performance will.
Well, you know, we're stuck
in these loop. Add a new log, go through CI
CD which includes the QA cycle and everything. This can
take hours. Then reproduce the issue in production
server with your fingers crossed and try to analyze
what went on. Hopefully you found the issue because
if not, it's effectively rinse, repeat for the whole
process. In the meantime, you still have a bug in production and
developers are crashing their time. This is so
bad. Some developers even advocate extreme solutions.
Just build everything in these cloud as servers functions to
avoid that inconsistency between local and remote environments.
Gareth, who quoted this
but a lot of faith in isolated unit tests for his functions.
Having written a lot of code, I have very limited faith
in my tests. Then there's the other
extreme that sees even microservices as
a way to get more money. I dont even need to ask Vlad what
he thinks about lambda. My personal opinion on this is
somewhere in the middle, but closer to Vlad, probably because I'm old.
I do think we need microservices and serverless in the right circumstances,
though. But regardless of your opinions here,
debugging in production is a hard and risky
process, even with fast deployments,
especially if you have a team larger than one person.
And Garth is correct when he says that production and local
development can never be quite the same.
There are just too many variables and too much noise. We can't
possibly avoid production only bugs.
There just has to be a better way. It's 2020
ones, and logs are the way we solve bugs in these day
and age. Don't get me wrong, I love logs, and today's
logs are totally different from what we had even ten years ago.
But you need to know about your problems in advance for a log to
work. The problem is, I'm not clairvoyant.
When I write code, I can't tell what bugs or
problems the code will have before the code is written.
I'm in the same boat as you are. The bug doesn't
exist yet, so I'm faced with the dilemma of whether
to log something. This is a bit like the
dilemma of writing comments. Does it make the code look
noisy and stupid, or will I find this
useful at 02:00 a.m. When everything isn't working and I want to rip out
a few strands of hair I still have left because of this damn
production problem? I'm sure you've all been in that
situation. I've been through it quite a lot.
As you can see on my head.
Debuggers are amazing. They let
us set breakpoints, see callbacks, inspect variables,
and more. If only we could do the same for production
systems. But debuggers weren't designed for this.
They're very insecure when debugging remotely. They can
block your server while sending debug commands remotely.
A small mistake, such as an expensive condition can literally
destroy your server. I might be repeating an urban legend
here, but 20 or so years ago I heard a story about a guy
who was debugging a railed system located on a cliff.
He stopped at a breakpoint during debugging, and the multimillion
dollar hardware fell into the sea because it didn't
receive the stop command. Again, I don't know if it's a true
story, but it's plausible. Debuggers weren't
really designed for those situations.
Worse, debuggers are limited to one server. If you
have a cluster with multiple machines, the problem can manifest on one
machine always or might manifest on a random machine.
We can't rely on pure luck.
If I have multiple servers with multiple languages,
platforms crossing from one to another with these debugger?
Well it's possible in theory,
but I can't even imagine it in reality.
Let's take the Batman metaphor all the way. We need
a team up. We need some help on the servers,
especially in a clustered polyglot environment where the issues
can appear on one server, move to the next, etc.
So you remember these slide. We need some way to get
through that line. We need a way to connect the server
with these server and debug it. What if you could
connect your servers to a debugger agent?
That would make sure you dont overload the server and don't make
a mistake, like setting a breakpoint or something like that.
That's what Lightrun does. It lets you connect securely
through your own server so you physically don't
have access to production. That means that lovely
line we have with DevOps is maintained.
DevOps have their control and know you can't
break production. Developers still get insight and can investigate
an issue to avoid the redeploy cycle.
There's a safety net in place protecting production
from harm, intentional or otherwise.
So how does this thing work? Well, we install the plugin
in our IDE and it lets us interact
with the lightrun server. We can insert an action
which can be a log or a snapshot, which I'll show
soon enough. Right now Lightrun supports Webstorm,
intellij idea, etc. But there's a visual studio
code plugin that will look very similar to what you're seeing here.
It will launch soon. Notice there's also a
command line option which I won't discuss here. The management
server sends everything to the agent which is
installed on your server. That means there's clean separation
between developer and production.
DevOps still has that guarding line we're talking about,
and we don't have direct access to production. That means
no danger to the running production servers from careless developers.
Well, like myself. The agent
is just a small runtime you add
to your servers. It's very low overhead and it implements
the debugging logic. Finally, everything is piped through
the server back to your ide directly. So as a
developer you can keep working in the IDE without
leaving your comfort zone. So let's go over a simple
hello world demo so we're all on the same page.
This is a simple express typescript hello world app.
It should be very familiar to anyone using nodejs.
I opened it here in webstorm and I want to debug
that app in the same way I would debug a real world app
in production. To do that, we need to sign
up on Lightrun.com and set up an agent and
IDE plugin which will let us debug this application.
Integrating lightrun requires two steps. The first
is very easy. Just run NPM install
Lightrun to add lightrun from NPM.
The second stage is adding this code
to the top of your main file. Notice that you
need these custom code from Lightrun, which includes the correct lightrun
secret value. Now we just log in from the
Lightrun id plugin and we're in business.
Back in Express TypeScript hello World app,
we can add the code snippet to the top of the file and
run the server. Here's the hello world
demo running. Let's go to the code and add a
snapshot for the main page request. These snapshot
will be hit when we load the main page.
Now let's add a log to the method covering
the hello URL. This is injected without
restarting the server or anything like that.
A snapshot is like a breakpoint. It gives us the
stacks, traces and variable values, but it doesn't stop the
debugger in place. So a production server
dont block on snapshot. The camera icon we
see on the left is the line on which
we set up the snapshot.
Let's reload now back in the ide I
can see the stacks trace popped up.
We only have one method on these course stack, which isn't
as valuable as usual, but we can see the variables and we
can see that the stacks was hit.
Let's go back to the browser and check the hello URL.
Notice it printed world.
Now in the ide we can see these new log we injected
is instantly these. Notice the response
in the code for world which is what we indeed saw
in the browser. I want a better demo,
but trying to run and explain a truly complex environment
in a conference session will probably confuse all
of us more than it will teach, myself included.
So I'm doing a relatively simple demo
in this app. All the front end logic is handled from the web
to these start of the back end with node js
and JavaScript. In the previous sample I showed off support
for typescript, so using JavaScript here makes sense.
Java is pretty common in the enterprise backend, so in this scenario that
makes sense. We're storing the data using spring
boot and the front end and the portion
of the backend is done in node JS.
So now that we got this out of the way, let's go to
the simple to do app used by so many demos in the
past. Now the funny thing is that I was going to
demonstrate something completely different and ran into this
bug by accident as I was working on the code.
Well, opportunity knocks. I can't clear completed
to do items in the app. Well,
it's time to pull out the debugger and figure out what's going
on. The only problem is that these
are two servers running on remote containers in production.
I can't just debug them. So let's look at
the code and see if I can spot some problem.
This is the method that implements clearcompleted in the node
servers. I assume you all know how to debug the client,
so I didn't really show that part. But even if we don't,
the technique would instantly show if the problem
is in these client or not. You'll notice
that we don't have any logging here. That would have been really nice,
but obviously we can put logs everywhere.
Besides the huge cost of
log at scale, these also costs and performance in the individual
machine. So I'm not a huge fan
of overlogging. So I already installed Lightrun
on both servers here. I also have Lightrun
installed in my IDE, so I can just add a snapshot
to the method to see if it's even reached.
As you may recall, a snapshot is like a breakpoint.
It gives us stack track and these variable values, but it
doesn't stop the debugger in place.
But we also want to see the end
of the method was if the end of the method was reached, did we
return from this method? If only there was a log.
Well, we can inject new logs in runtime.
We can just add a new log and type in the details we want in
that log and bam, we can see that log
right here. The cool thing is that if you
have built in logs, these log will be integrated
with the rest of those logs so you can see it in context.
In this case, I selected that the log will pipe
into the IDE console too, which is more convenient for
demos and quick tests. But when you do this in real life,
viewing the log as if it was written in
the code is a game changer
that can literally save you millions. I know
we say that about a lot of features, but the ability to
see these things in context can seriously shorten the amount of
time spent debugging. It also means you can eliminate
a lot of your redundant debugging overhead and
reduce your logs cloud storage cases significantly.
The savings in a large company can be very
significant. Now let's go back into the browser
and press that dysfunctional button. Then go back into
the ide and wait a couple of seconds for the
whole loopback of events to reach us.
Now we can see the snapshot was hit.
That means the calls reached that method,
which already eliminates a lot of potential problem.
Now we can look in the stack trace on the
left to see who invoked that method, or look within
variables on the right to see if there is an interesting variable
we missed. We should also check out the log console
where we can see that the method reached the log line,
so we didn't get an exception or some
other problem. This is odd. Let's check the springboot
servers. This is the Springboot
server code. Notice this is Java code representing a
rest request for the clear completed API.
It deletes all the elements that are marked completed, then returns
the to do items that remain. I want to add a log
here just like I did in the node JS code, but in this case
I'm going to do something a bit different. I'm going to add
an expression to the log. I'm going to print out the number of items
in the response variable, which is of type list.
That means I'll literally invoke the size method on
that object within the curly braces here.
Notice you could do the same exact thing in these node JS project
using JavaScript code for the expression.
Now you might be thinking why not print the whole
thing? Why not print the response object?
Well, it won't work. Well,
there are two reasons. First, it might output tools much text
and get cropped. But the more important aspect is evaluation
speed. When we're running in production environments,
changes can be very dangerous. If we
evaluate these expressions dynamically, it might take too much
cpu and crashing our server. So lightrun limits the
amount of operations you can do in an expression.
If I'll print that list, Lightrun might just eliminate my log because
of cpu usage. So the code here needs to
be very efficient.
So let's press ok to add the new
log. We'll then go to the browser to trigger
the event. I'll press clear completed again and
then we'll go back to the server. Back in
the servers we can see the log showing and that's weird.
The API is showing that we're returning two elements.
That doesn't make sense, but this is the core idea of
debugging. Validate your assumptions.
I don't get that. I want to check the values,
so for that I need to edit the log entry and instead of
listing the size of the list. I want to print out these content of the
first element in the list using the got zero method call
pressing ok deletes the current log and creates a new log
object in its place. We now need
to go back to the web and clear the completed elements again,
and after a short wait the log prints out the result.
It seems that this is the bug.
Looking at the values for the object, it's obvious that the completed
field is false on the servers. That's why
it wasn't cleared. With that in hand,
I looked at the code that toggles the completed field on the nodejs
server and oops. It seems that
code was never written. These completed field
was only set on initialization. So that's
the bug. That's very cool, but that's
just one type of production problem. For a different production
problem. Let's go back to the slides and look at ones
of my first slides. This is
the dashboard I found off of Google. It isn't mine,
but you probably see a couple of interesting things here.
I mean, what's this? What the
hell happened here? If you use a dashboard, you probably see
stuff like that all the time and ask yourself the exact
same question. Now apms are pretty sweet.
These will usually tell you which entry point caused a performance issue,
which cpu worked too hard, where memory was used up,
and all of those things. But what they won't tell
you is which block of code took too long.
Let's say we have an entry point that takes too long to
perform. Now what?
We need to review deep calls all over the place to
try and pinpoint a specific operation.
This might be something we can't reproduce locally.
Let's go back into the IDE briefly and go over
the options here in the metrics menu.
We can add measurements that give us fine grained benchmarks
on the code level. We can count the amount of time line of
code times a line of code was hit. We can measure the duration of
a block of code or a method. We can even add more complex
metrics based on customer evaluation. The result
of metrics can be periodically logged or piped
into Prometheus statsD, etc. This provides
that missing piece that we need to see beyond what the APM
provides us. That way we won't spend ages optimizing the
wrong area because the production information didn't include
all of the lower level information. Please notice
that at this moment measurements are only available for JVM languages,
but they will be coming to nodejs and Python in the update.
That's coming soon. In closing,
I'd like to go over what we discussed here and a few things we
didn't have time for. Lightrun supports JVM
languages like Java, Kotlin, Scala, etc. It supports
nodejs for both Javascript and typescript and
Python, even complex stuff like airflow.
When we add actions, conditions run within a sandbox
so they don't take up cpu or crash the system.
This all happens without networking, so something like a networking
hiccup won't crash the server. Security is
especially crucial with solutions like this.
One of the core concepts is that the server
queries information, not the other way around as
you would see with solutions such as JDWP etc.
This means operations are atomic and these server can
be hidden behind firewalls even from the rest of the organization.
PIi production lets us define conditions
that would obscure patterns in the logs. So if a user could
print out a credit card number by mistake, you can define a
rule that would block that. This way the bad
data won't make it into your logs and won't expose you
to liability. Blacklisting bets us block
specific classes, methods or files. This means
you can block developers in your organization from debugging
specific files. This means a developer won't be able
to put a snapshot or a log
in a place where a password, for instance, might be
available to steal user credentials or stuff like that.
Besides these sandbox, I'd also like to mention that Lightrun
is very efficient and in our benchmarks has almost no runtime
impact. When it isn't used, it has very small
impact even with the multiple actions in place.
Finally, Lightrun can be used from the cloud or using
an on premise install. It works with any deployment
you might have, whether cloud based, container based on premise,
microservices, serverless, etc.
Thanks for bearing with me. I hope you enjoyed the presentation.
Please feel free to ask questions and also feel free to write to
me. Also, please check out talktothoduct dev
where I talk about debugging in depth. And check out
lightrun.com which I think you guys will like a lot.
If you have any questions, my email is listed here and I'll be happy
to help. Thank you.