Debug NodeJS Applications in Production with Lightrun

Video size:

Abstract

When a production problem happens all bets are off. We’ll do anything to “save” production and get us back on track. But by then it’s too late. If the logs don’t have this information already then we need to add new logs. That means running the full test cycle again, updating servers and hoping that this time we’ll see the issue.

This is bad. Production bugs cost us money. Both in engineer time and in lost revenue for some cases. These are often critical bugs that we need to fix right away. They are also the nastiest bugs… The ones that got through unit testing, integration testing, QA and staging. They are the most “deadly” form of bug and our only defense against them is “logs”???

Newer tools let us skip this hellish cycle. We can debug in production without the risks of crashing the servers. We can work in a highly clustered/distributed environment without a problem. We can inject new logs dynamically, get breakpoint stacks and much more using such tools.

Summary

Today we talk about debugging node js applications in production with Lightrun. We're truly at the golden age of monitoring hell. This line on one side we have developers, on the other we have ops or DevOps. We need a better solution than this hard separation.
Lightrun lets you connect securely through your own server so you physically don't have access to production. That means there's clean separation between developer and production. There's a visual studio code plugin that will look very similar to what you're seeing here. It will launch soon.
A simple express typescript hello world app. Integrating lightrun requires two steps. First is to sign up on Lightrun. com and set up an agent and IDE plugin. Second is adding this code to the top of your main file.
The ability to see logs in context can seriously shorten the amount of time spent debugging. It also means you can eliminate a lot of your redundant debugging overhead and reduce your logs cloud storage cases significantly. The savings in a large company can be very significant.
APms can tell you which entry point caused a performance issue. But what they won't tell you is which block of code took too long. We can add measurements that give us fine grained benchmarks on the code level. At this moment measurements are only available for JVM languages, but they will be coming to nodejs and Python soon.
Lightrun supports JVM languages like Java, Kotlin, Scala, etc. It supports nodejs for both Javascript and typescript and Python. Conditions run within a sandbox so they don't take up cpu or crash the system. Lightrun can be used from the cloud or using an on premise install.
Thanks for bearing with me. I hope you enjoyed the presentation. Feel free to ask questions and also feel free to write to me. If you have any questions, my email is listed here and I'll be happy to help.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone. Today we're going to talk about debugging node js applications in production with Lightrun. But first, a few things about me I was a consultant for over a decade. I worked at sun, founded a couple of companies, wrote a couple of books. I wrote a lot of open source code, and currently work as developer advocate for Lightrun. My email and Twitter accounts are listed here, so feel free to write to me. I have a new blog that talks about debugging and production issues at talktotheduck dev. It would be great if you check it out and let me know what you think. I want to put my heart out there. I love apms. They are absolutely wonderful. I'm old enough to remember a time where they weren't and weren't around and I'm just so happy that we moved past that. These is absolutely amazing. The dashboards and the details. You get these great dashboard with just everything you need. Amazing. We're truly at the golden age of monitoring hell. When I started, we used to literally monitor the server by kicking it and listening to see if the hard drive was spinning properly. These vendors provide amazing dashboards that let you see every stat that matters in your app. Monitoring is absolutely amazing. And if you don't use these tools on your production servers, well, you're missing out and you might have issues you're unaware of. In fact, a lot of these issues we run into start when we notice an anomaly in the dashboard. We see a spike in failures or something that performs a bit too slow. The APM is amazing in showing those hiccups, but this is where it stops. It can tell us that a web service performed badly or failed. It can't tell us why. It can't point at a line of code that we need. So let's stop for a second and talk about a different line altogether. This line on one side we have developers, on the other we have ops or DevOps. This is a line we've had for a long time. It's something we drew out of necessity because when developers were given access to production, well, I don't want to be too dramatic, but when developers got access to production, it didn't end well. This was literally the situation not too long ago. Yes, we had sysadmins, but the whole process used to be a mess. That was no good. But we need a better solution than this hard separation because the ops guys don't necessarily know how to solve the problems made by the developers. These know how to solve ops problems because this is really what those monitoring tools are all about. They're the bat signal. They come up and we're Batman or Batwoman or Bat person. All of us damn heroes in these cases, we step up to deal with the bugs and we're the last line of defense against their villainy. DevOps have enough of villains of their own that they need to deal with for this case. They need us, but, well, we're coder. But people, it's kind of the same thing, just without the six pack abs, too much baked goods. And in the company kitchen over here, coder Batman needs to know these the crime or bugs are happening in the code. So those dashboards, these point us in the general direction these the crime is, but not at the exact specific location where we need to fight crime. And here's where things get hard. We start digging into logs, trying to find the problem. The dashboard sent us into a general direction like performance problem or high error rates. But now we need to jump into logs and hope that we can find something there that will somehow explain the problem we're seeing. That's like going from a jet engine back to the Stone age tools. Now there are many log processing platforms that do amazing job at processing these logs and finding these gold within these. But even then it's a needle in a haystack. And that's the good outcome where a log is already there waiting for us. But obviously we can't have logging all over the place. Our billing would go through the roof and our performance will. Well, you know, we're stuck in these loop. Add a new log, go through CI CD which includes the QA cycle and everything. This can take hours. Then reproduce the issue in production server with your fingers crossed and try to analyze what went on. Hopefully you found the issue because if not, it's effectively rinse, repeat for the whole process. In the meantime, you still have a bug in production and developers are crashing their time. This is so bad. Some developers even advocate extreme solutions. Just build everything in these cloud as servers functions to avoid that inconsistency between local and remote environments. Gareth, who quoted this but a lot of faith in isolated unit tests for his functions. Having written a lot of code, I have very limited faith in my tests. Then there's the other extreme that sees even microservices as a way to get more money. I dont even need to ask Vlad what he thinks about lambda. My personal opinion on this is somewhere in the middle, but closer to Vlad, probably because I'm old. I do think we need microservices and serverless in the right circumstances, though. But regardless of your opinions here, debugging in production is a hard and risky process, even with fast deployments, especially if you have a team larger than one person. And Garth is correct when he says that production and local development can never be quite the same. There are just too many variables and too much noise. We can't possibly avoid production only bugs. There just has to be a better way. It's 2020 ones, and logs are the way we solve bugs in these day and age. Don't get me wrong, I love logs, and today's logs are totally different from what we had even ten years ago. But you need to know about your problems in advance for a log to work. The problem is, I'm not clairvoyant. When I write code, I can't tell what bugs or problems the code will have before the code is written. I'm in the same boat as you are. The bug doesn't exist yet, so I'm faced with the dilemma of whether to log something. This is a bit like the dilemma of writing comments. Does it make the code look noisy and stupid, or will I find this useful at 02:00 a.m. When everything isn't working and I want to rip out a few strands of hair I still have left because of this damn production problem? I'm sure you've all been in that situation. I've been through it quite a lot. As you can see on my head. Debuggers are amazing. They let us set breakpoints, see callbacks, inspect variables, and more. If only we could do the same for production systems. But debuggers weren't designed for this. They're very insecure when debugging remotely. They can block your server while sending debug commands remotely. A small mistake, such as an expensive condition can literally destroy your server. I might be repeating an urban legend here, but 20 or so years ago I heard a story about a guy who was debugging a railed system located on a cliff. He stopped at a breakpoint during debugging, and the multimillion dollar hardware fell into the sea because it didn't receive the stop command. Again, I don't know if it's a true story, but it's plausible. Debuggers weren't really designed for those situations. Worse, debuggers are limited to one server. If you have a cluster with multiple machines, the problem can manifest on one machine always or might manifest on a random machine. We can't rely on pure luck. If I have multiple servers with multiple languages, platforms crossing from one to another with these debugger? Well it's possible in theory, but I can't even imagine it in reality. Let's take the Batman metaphor all the way. We need a team up. We need some help on the servers, especially in a clustered polyglot environment where the issues can appear on one server, move to the next, etc. So you remember these slide. We need some way to get through that line. We need a way to connect the server with these server and debug it. What if you could connect your servers to a debugger agent? That would make sure you dont overload the server and don't make a mistake, like setting a breakpoint or something like that. That's what Lightrun does. It lets you connect securely through your own server so you physically don't have access to production. That means that lovely line we have with DevOps is maintained. DevOps have their control and know you can't break production. Developers still get insight and can investigate an issue to avoid the redeploy cycle. There's a safety net in place protecting production from harm, intentional or otherwise. So how does this thing work? Well, we install the plugin in our IDE and it lets us interact with the lightrun server. We can insert an action which can be a log or a snapshot, which I'll show soon enough. Right now Lightrun supports Webstorm, intellij idea, etc. But there's a visual studio code plugin that will look very similar to what you're seeing here. It will launch soon. Notice there's also a command line option which I won't discuss here. The management server sends everything to the agent which is installed on your server. That means there's clean separation between developer and production. DevOps still has that guarding line we're talking about, and we don't have direct access to production. That means no danger to the running production servers from careless developers. Well, like myself. The agent is just a small runtime you add to your servers. It's very low overhead and it implements the debugging logic. Finally, everything is piped through the server back to your ide directly. So as a developer you can keep working in the IDE without leaving your comfort zone. So let's go over a simple hello world demo so we're all on the same page. This is a simple express typescript hello world app. It should be very familiar to anyone using nodejs. I opened it here in webstorm and I want to debug that app in the same way I would debug a real world app in production. To do that, we need to sign up on Lightrun.com and set up an agent and IDE plugin which will let us debug this application. Integrating lightrun requires two steps. The first is very easy. Just run NPM install Lightrun to add lightrun from NPM. The second stage is adding this code to the top of your main file. Notice that you need these custom code from Lightrun, which includes the correct lightrun secret value. Now we just log in from the Lightrun id plugin and we're in business. Back in Express TypeScript hello World app, we can add the code snippet to the top of the file and run the server. Here's the hello world demo running. Let's go to the code and add a snapshot for the main page request. These snapshot will be hit when we load the main page. Now let's add a log to the method covering the hello URL. This is injected without restarting the server or anything like that. A snapshot is like a breakpoint. It gives us the stacks, traces and variable values, but it doesn't stop the debugger in place. So a production server dont block on snapshot. The camera icon we see on the left is the line on which we set up the snapshot. Let's reload now back in the ide I can see the stacks trace popped up. We only have one method on these course stack, which isn't as valuable as usual, but we can see the variables and we can see that the stacks was hit. Let's go back to the browser and check the hello URL. Notice it printed world. Now in the ide we can see these new log we injected is instantly these. Notice the response in the code for world which is what we indeed saw in the browser. I want a better demo, but trying to run and explain a truly complex environment in a conference session will probably confuse all of us more than it will teach, myself included. So I'm doing a relatively simple demo in this app. All the front end logic is handled from the web to these start of the back end with node js and JavaScript. In the previous sample I showed off support for typescript, so using JavaScript here makes sense. Java is pretty common in the enterprise backend, so in this scenario that makes sense. We're storing the data using spring boot and the front end and the portion of the backend is done in node JS. So now that we got this out of the way, let's go to the simple to do app used by so many demos in the past. Now the funny thing is that I was going to demonstrate something completely different and ran into this bug by accident as I was working on the code. Well, opportunity knocks. I can't clear completed to do items in the app. Well, it's time to pull out the debugger and figure out what's going on. The only problem is that these are two servers running on remote containers in production. I can't just debug them. So let's look at the code and see if I can spot some problem. This is the method that implements clearcompleted in the node servers. I assume you all know how to debug the client, so I didn't really show that part. But even if we don't, the technique would instantly show if the problem is in these client or not. You'll notice that we don't have any logging here. That would have been really nice, but obviously we can put logs everywhere. Besides the huge cost of log at scale, these also costs and performance in the individual machine. So I'm not a huge fan of overlogging. So I already installed Lightrun on both servers here. I also have Lightrun installed in my IDE, so I can just add a snapshot to the method to see if it's even reached. As you may recall, a snapshot is like a breakpoint. It gives us stack track and these variable values, but it doesn't stop the debugger in place. But we also want to see the end of the method was if the end of the method was reached, did we return from this method? If only there was a log. Well, we can inject new logs in runtime. We can just add a new log and type in the details we want in that log and bam, we can see that log right here. The cool thing is that if you have built in logs, these log will be integrated with the rest of those logs so you can see it in context. In this case, I selected that the log will pipe into the IDE console too, which is more convenient for demos and quick tests. But when you do this in real life, viewing the log as if it was written in the code is a game changer that can literally save you millions. I know we say that about a lot of features, but the ability to see these things in context can seriously shorten the amount of time spent debugging. It also means you can eliminate a lot of your redundant debugging overhead and reduce your logs cloud storage cases significantly. The savings in a large company can be very significant. Now let's go back into the browser and press that dysfunctional button. Then go back into the ide and wait a couple of seconds for the whole loopback of events to reach us. Now we can see the snapshot was hit. That means the calls reached that method, which already eliminates a lot of potential problem. Now we can look in the stack trace on the left to see who invoked that method, or look within variables on the right to see if there is an interesting variable we missed. We should also check out the log console where we can see that the method reached the log line, so we didn't get an exception or some other problem. This is odd. Let's check the springboot servers. This is the Springboot server code. Notice this is Java code representing a rest request for the clear completed API. It deletes all the elements that are marked completed, then returns the to do items that remain. I want to add a log here just like I did in the node JS code, but in this case I'm going to do something a bit different. I'm going to add an expression to the log. I'm going to print out the number of items in the response variable, which is of type list. That means I'll literally invoke the size method on that object within the curly braces here. Notice you could do the same exact thing in these node JS project using JavaScript code for the expression. Now you might be thinking why not print the whole thing? Why not print the response object? Well, it won't work. Well, there are two reasons. First, it might output tools much text and get cropped. But the more important aspect is evaluation speed. When we're running in production environments, changes can be very dangerous. If we evaluate these expressions dynamically, it might take too much cpu and crashing our server. So lightrun limits the amount of operations you can do in an expression. If I'll print that list, Lightrun might just eliminate my log because of cpu usage. So the code here needs to be very efficient. So let's press ok to add the new log. We'll then go to the browser to trigger the event. I'll press clear completed again and then we'll go back to the server. Back in the servers we can see the log showing and that's weird. The API is showing that we're returning two elements. That doesn't make sense, but this is the core idea of debugging. Validate your assumptions. I don't get that. I want to check the values, so for that I need to edit the log entry and instead of listing the size of the list. I want to print out these content of the first element in the list using the got zero method call pressing ok deletes the current log and creates a new log object in its place. We now need to go back to the web and clear the completed elements again, and after a short wait the log prints out the result. It seems that this is the bug. Looking at the values for the object, it's obvious that the completed field is false on the servers. That's why it wasn't cleared. With that in hand, I looked at the code that toggles the completed field on the nodejs server and oops. It seems that code was never written. These completed field was only set on initialization. So that's the bug. That's very cool, but that's just one type of production problem. For a different production problem. Let's go back to the slides and look at ones of my first slides. This is the dashboard I found off of Google. It isn't mine, but you probably see a couple of interesting things here. I mean, what's this? What the hell happened here? If you use a dashboard, you probably see stuff like that all the time and ask yourself the exact same question. Now apms are pretty sweet. These will usually tell you which entry point caused a performance issue, which cpu worked too hard, where memory was used up, and all of those things. But what they won't tell you is which block of code took too long. Let's say we have an entry point that takes too long to perform. Now what? We need to review deep calls all over the place to try and pinpoint a specific operation. This might be something we can't reproduce locally. Let's go back into the IDE briefly and go over the options here in the metrics menu. We can add measurements that give us fine grained benchmarks on the code level. We can count the amount of time line of code times a line of code was hit. We can measure the duration of a block of code or a method. We can even add more complex metrics based on customer evaluation. The result of metrics can be periodically logged or piped into Prometheus statsD, etc. This provides that missing piece that we need to see beyond what the APM provides us. That way we won't spend ages optimizing the wrong area because the production information didn't include all of the lower level information. Please notice that at this moment measurements are only available for JVM languages, but they will be coming to nodejs and Python in the update. That's coming soon. In closing, I'd like to go over what we discussed here and a few things we didn't have time for. Lightrun supports JVM languages like Java, Kotlin, Scala, etc. It supports nodejs for both Javascript and typescript and Python, even complex stuff like airflow. When we add actions, conditions run within a sandbox so they don't take up cpu or crash the system. This all happens without networking, so something like a networking hiccup won't crash the server. Security is especially crucial with solutions like this. One of the core concepts is that the server queries information, not the other way around as you would see with solutions such as JDWP etc. This means operations are atomic and these server can be hidden behind firewalls even from the rest of the organization. PIi production lets us define conditions that would obscure patterns in the logs. So if a user could print out a credit card number by mistake, you can define a rule that would block that. This way the bad data won't make it into your logs and won't expose you to liability. Blacklisting bets us block specific classes, methods or files. This means you can block developers in your organization from debugging specific files. This means a developer won't be able to put a snapshot or a log in a place where a password, for instance, might be available to steal user credentials or stuff like that. Besides these sandbox, I'd also like to mention that Lightrun is very efficient and in our benchmarks has almost no runtime impact. When it isn't used, it has very small impact even with the multiple actions in place. Finally, Lightrun can be used from the cloud or using an on premise install. It works with any deployment you might have, whether cloud based, container based on premise, microservices, serverless, etc. Thanks for bearing with me. I hope you enjoyed the presentation. Please feel free to ask questions and also feel free to write to me. Also, please check out talktothoduct dev where I talk about debugging in depth. And check out lightrun.com which I think you guys will like a lot. If you have any questions, my email is listed here and I'll be happy to help. Thank you.

Slides

Download slides (PDF)

See all 39 talks at this event!

Conf42 JavaScript 2021 - Online

October 28 2021

Debug NodeJS Applications in Production with Lightrun

Video size:

Abstract

Summary

Transcript

Slides

Shai Almog

Developer Advocate @ Lightrun

Join the community!

Featured event

2025

2024

Info

Conf42 JavaScript 2021 - Online

October 28 2021

Debug NodeJS Applications in Production with Lightrun

Video size:

Abstract

Summary

Transcript

Slides

Shai Almog

Developer Advocate @ Lightrun

Join the community!