Getting back to sleep as soon as possible for the on-call developer

Video size:

Abstract

A walk through a close-to-real life incident from the perspective of the on-call developers. We will discuss practical and technical steps developers can take to increase observability while on-call (even when a serious, hard to debug issue arises).

Summary

Tom Granot is a developer advocate working for Lightrun. Previously worked as a site reliability engineer for Actix in Munich, Germany. Talk focuses on getting back to sleep as soon as possible for the on call developer.
Jamie is a Java developer with about five years of experience working for the enterprise application company that produces Oasis. Jamie works on the transaction system inside the application that is in charge of allowing users to purchase various things. On call always brings new challenges. The best situation with a running system is a situation of observability.
There's another subset of tooling called exception or error monitoring and handling tooling. These tools enable you to get a better grasp when things go bad. But the more interesting part is talking about what our tools don't tell you. Examples include incorrect application state and races.
A customer complained about an error in a small application. It took the engineer a few minutes to investigate. The problem was how to make a system observable. The best way to define this new practice is continuous observability.
A good postmortem is composed of information from all the different teams. Having them enables you to solve the problems much quicker. There's also this kind of humility that developers should take here. If there's something that relates to the underlying infrastructure, ask questions.
I'm really happy you guys joined me. Feel free again to reach out to me over Twitter over GitHub. Please make sure to check out Lightrun. I think you would be pleasantly intrigued by everything we have to offer. See you again soon.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, my name is Tom and this is my talk getting back to sleep as soon as possible for the on call developer or how to win at on call. This is me, Tom Granot. I'm a developer advocate working for Lightrun and previously I was a site reliability engineer working for a company called Actix that does distributed systems in Munich, Germany. If you want to reach out to me, you can do so over Twitter or using my email, which is Tom granot, liverin.com. Let's go. I'd like to introduce you to Jamie. Jamie is a Java developer with about five years of experience working for the enterprise application company that produces Oasis. Very simple, run of the mill enterprise application. And specifically, Jamie works on the transaction system inside the application that is in charge of allowing users to purchase various things inside the application. Now, Oasis, being a system that is viewed by many corporations worldwide, is active 24/7 there's always one developer on call. Tonight it's Jamie, and also there's a DevOps engineer and one or two support engineers that are there for more minor escalations. In addition, Olasis wide, there is a NOC team that is in charge of monitoring all the different systems. And I think before we actually set the scene and dive into exactly what happened in the current incident, it's worth noticing or explaining how the stack for the transaction system looks like. So it's a microservices based application. There are different services, a credit service, a fraud service, an external billing API, a gateway, and so forth, that are in charge of the proper operation of the application. And each service, or more correctly, each system inside the Oasis application has its own monitoring stack propped up by the DevOps engineers in order to separate the concern. Whether or not this was the right decision is something for the past. Jamie wasn't there when it was decided. It's just the current situation. But the monitoring stack is something Jamie knows and appreciates. The LK stack elasticsearch log Stack in Kibana and is a very comfortable way of understanding what's going on inside your application. Now, the problem with oncall is that, and Jamie has been doing it for a long while, so it's ingrained inside the brain that on call always brings new challenges. So it's either something completely new that you've never seen before, it might be something that you've seen before, or another team member has seen before, but nobody documented. It might be something that is completely outside of your scope, some external system failing, but there's always something. And to properly understand that, why so many things happen in production and in these on call sessions. It's important to understand the players, the people actually doing the work during these sessions. So the developers engineers are usually confined to the infrastructure. They know how the infrastructure works. They can mitigate any blocks or timeouts or problems in the infrastructure level. And the support engineers are more playbook followers. They see a set of circumstances happening, and they know how to operate twist knobs, turn switches, and off and on and off. And not to solve various problems, but just being immersed in the environment, just understanding it completely, just being part of it, is very hard due to the noise in the system. There are so many things going on at any given point in time. There are logs coming in and support tickets and phone numbers, and customers polling you messaging, sorry, on slack, other teammates messaging on slack. It's basically lots of noise all around. And whenever there's a problem that's remotely complex that none of the above people can solve, developers are called, in this case, Jamie. And in our specific case tonight, a wild problem has appeared. Some transaction is failing. The support engineers are looking and they're trying to operate a situation, trying to kind of progress with it, and they're not making headway. They get stuck. The playbooks are not working. It's really hard to understand what's going on. And so they turn to the DevOps team to see whether it's maybe an infrastructure problem, something not in the application logic, but something actually inside the infrastructure itself. And the DevOps team look at the monitoring systems, and nothing is apparent. So, as I mentioned before, when something like this happened, some remotely complex application logic problem, you call the developers, in this case JB. It's really important to understand what it is that a system tells you about itself during production. The absolute best situation with a running system is a situation of observability. It's a password. It's been thrown around many different times lately. The bottom line is that it can be defined as the ability to understand how your systems work on the inside without chipping new code and without chipping new code. Bit is the core part here. It's how we differentiate a system that is just monitored and a system that is fully observable, instead of relying on adding adfixes and adding more and more instrumentations and logs to the running application in order to understand what's going on. Instead, the system, in the way it was originally instrumented, tells you all the story and tells you what is wrong with it by allowing you to explore in real time what is going on. And I think it's also important to differentiate what a monitoring and observability stack gives you. The current tooling and things that we have inside our oncall situation give you and what they don't. To be completely honest, many monitoring stacks today are really good. They tell you a lot of the information. It's easy to understand many of the problems that it used to be really hard to understand. And the first primitive, if you will, in this situation is the log line. So, logs that are being emitted out of the application in real time are the things that tell you what is happening at any given time in the application. It's a timestamped event log of what is going on inside the application. But this is only on the application level. The application is not running in a void. It's running on a piece of machine, or multiple machines that have their own health situation. The amount of cpu they use, the amount of memory they consume and so forth. And if the machine beneath the application is not feeling okay, it's overloaded, then the application will show. But if we go even one step further, these are two kind of pieces of information that are bit further apart from one another. So the log tell you this timestamped event log of the application. And the itmetrics tells you how the machines beneath field. But in order to understand how the actual service, which is a piece of software that is intended to bring value to the customer, how it looks like from the customer perspective, there's also another set of metrics, service metrics, business level metrics, SLA metrics, depends on the organization. They're called differently in different places. And these metrics relate more to the purported value the system gives to the users. And this could be things as how good of information the system is emitting, how much time it takes it to emit this information, how many outages there are and so forth. This is the value as it is perceived by the customer from the software. But this is not granular enough information. Sometimes you want to understand what's going on after you call a specific endpoint. And distributed tracers. These are the tracers I'm talking about here. Stuff like zipkin Yeager. There are many of those tools. They allow you to inspect an endpoint and say, all right, this endpoint takes x time to respond and it calls three downstream API endpoints, each of them taking another third x of time to respond. And this allows you to more kind of surgically go inside and better understand how can API endpoint performs like in real life. But even that is not always enough. And sometimes when the application itself is constructed above, a virtual machine like Java is over the JVM you have a set of tools called profilers. And it's not only limited to tools running over virtual machines. There are many different type of tools that are called profilers that work at different levels of the stack. What I'm referring to is stuff like J profiler, things that profile the JVM and tell you how another layer of obstruction in the middle is failing. So not the machines running beneath, not the application running above, but the JVM in the middle. And speaking of tools that we don't often see in production profiles have incurred a lot of overhead. Debuggers are also another set of tooling that allow you to explore random code path. Basically step in, step out of different flows and understand how a user behaves like in real life. And these debuggers are again, there are remote debuggers that can be used in production, but usually it's a tool used inside development to walk through different code paths. There's another subset of tooling called exception or error monitoring and handling tooling. And these tools enable you to enrich the information you receive once something bad happens. So an error has happened or can exception was thrown. And these tools enable you to get a better grasp when things go bad. So when things go out of all, not all exceptions are bad. But I think the more interesting part is talking about what our tools don't tell you. And the first and foremost thing that comes to mind is incorrect application state. This can come in so many forms, it can come as a request coming from the product that was incorrectly implemented, an incorrect request coming from product that was correctly implemented. Whatever set of circumstances that caused the application state to be incorrect, perhaps edge case that wasn't tested for. And again, there are many different ways this could happen. This ends up being a detriment on the user's experience. The user using the application doesn't really care what we did wrong. The only thing the user cares about is the application is not working and our tools will not pinpoint and say this is wrong. Logic. Logic is logic. It's difficult to account for unless you tell the system exactly what it is supposed to do. And this is something that even modern monitoring stacks don't really account for. Another thing that's hard to understand, but is more mechanical if you will, is quantitative data on the code level. So assume you're getting an OoM for some reason and there's a data structure exploding in memory and that data structure exploding in memory is the thing that's causing the OOM. This is not something you would usually be able to determine unless you've instrumented the metrics beforehand. So understanding the size, visualizing a graph of how the data structure behaves over time, unless you actively ask for it, it is not information that will always be readily or even at all available to you. Another thing that is more related to developer mistakes, or perhaps not mistakes, but just things you don't always catch during developers. I'll swallow the exception, a situation in which an exception was thrown and caught and nothing was actually emitted doing that. So some side effect that was supposed to happen did not happen because of the exception. Maybe some piece of information that was supposed to be emitted. Not even a side effect, but some piece of information was not emitted. And this, when looking at the application again, trying to observe it and understand what's going on inside of it, it will be very hard because there's a piece of the logic you're not seeing. It's literally swallowed by the application. This is not only about developer level things. There are some things where you use some external API, not all at your control, some external API completely abstracted away from you that behaves in weird ways. Maybe the format is not the format you expected, it does not conform to specification. It times out in ways you did not expect. The state after you made a request, changes on the next request, and so forth. So many moving pieces in production that you can always kind of ask questions, can always ask to dump the information, but that usually requires asking the question. Adding a logs to dump that piece of information into the logs. Sometimes this is actually not allowed, especially when PCI compliance or HIPAA is involved. And then you would need to not log the entire object because there will be private information there, but instead find a way around it. And there's also this more nuanced problem when the application state is correct, but the user took the incorrect flow in order to get where the user is currently at, and it's difficult to understand how to get the user out of that situation. This happens, for example, when requests are coming with will request parameters and the user is receiving not the experience of the application they expected to receive. This can be as simple as perhaps a menu not opening correctly, and can be as bad as what recently happened with GitHub when a user actually logged into the wrong session. So this is the best example of unexpected user flow. The other piece of the puzzle, I guess, which is things that are hard to reproduce and only happen in specific conditions, not for specific users, but just under a specific set of conditions, are races. So those are shared resources, and the order in which two parties reach to it matters. It shouldn't, because it should be everybody gets the same treatment, or everybody gets the correct flow of operations. But sometimes races do occur, and when they happen, they're hard to reproduce. You have to wait for the exact set of circumstances for them to happen. And even then, most monitoring tools will not tell you why the race happened, or even which set of circumstances, the full set of circumstances that is needed to reproduce the race. And therefore, many races end up being left in bugs for time and time over, until basically, we'll see that bug the next time it appears. And one important thing to mention about these types of problems is what happens when you move away from developing locally and you're moving to deploying your applications into production, when there are a lot of replicas of your application, and perhaps if you're a monolith, there are specific parts of the application that get more load than other parts, and they affect the former parts in the application. There are all this consideration when we're working with either a large load on specific pieces of the application. Perhaps things you didn't actually work on that belong to somebody else, but you steal resources from them. Or when you take the same application and run it across multiple instances in order to better understand, or perhaps not better understand or provide a better service by allowing for multiple instances of the same piece of code, weird things can happen in the in betweens. And I just want to show you a demo to better explain exactly what I'm talking about, show you what these type of problems looks like in the wild. But before that, it's important to understand the context that drove us here. Basically what we know. And the support engineers received the ticket, which means they know the specific customer that's acting up, and they even know the specific id for that user. It's 2911. And the DevOps team also knows that the monitoring system is not showing any problems. The metrics look okay, and I just want to remind you that there are about six services inside the current OS transaction system, and there's a monitoring stack based on Elk. Let's assume that Jamie is being called into the situation, and all he knows is that the metrics look okay, and there's a specific user id that's acting up. This is the Kibana dashboard, or the Kibana entry screen. And that Kibana board is used heavily by the DevOps team, and perhaps even the support engineers, in order to understand what's going on. But from Jamie's perspective, if the metrics are okay, and the support engineers said that the usual things don't work, then the first thing, or the second thing, perhaps that Jamie will do, will go into the log trail. So you go into the log trail and you look for the log lines for the specific user to see in your own eyes what is happening with it. And I'm going to pass that number here. And we can see that the log are actually very clear. It looks like there's a step by step, service by service oncall into each following the length of the transaction. So I can see that the inventory resource inside the inventory service started to validate the transaction. Looks like the same service, sorry, also tries to process the purchase. Then there's a sequence of calls to transaction service, then credit service, then fraud service, then external billing service, culminating in the kind of back propagation of everything else back down the line. So because you're looking at the logs and they're not telling you the full story, it doesn't look like something in the transaction is wrong. It looks like the transaction finished processing correctly. We're basically stuck. There is no extra path to go here. There is no more information the application tells me about itself out of the box that hasn't been checked by me or by other members of my team that explain exactly what's going on. And it really depends. The playbooks differ here from oncall team. From oncall team. There are some people that have specific tooling that are used to solve these type of problems, basically get more information. Some people throw the same environment over staging. Some people have a ready made database with shadow traffic or shadow data that enables them to test things in a kind of more safe environment. Some people just connect the remote debugger and walk through the flow piece by piece using that specific user, because that specific context of the user because it's allowed. It really depends on the exact situation that you have. And I just want to mention and kind of focus on this feeling of helplessness, if you will. Right. It's not clear what's going on, and it's really annoying. It's not something you accounted for. Something is happening, and the application is not telling you what it is. But Jamie is a cool headed, very down to earth person, and the code itself is the next place to go. So the next thing is indeed to go into the code, and I'm going to open my intellij that is simulating Jamie's intellij. And I'm going to start walking through the code in order to understand what exactly is going on. And the first thing is looking for the inventory resource, which is the thing that starts all the things. I'm going to look for inventory resource, and you can see there's a controller here and as expected, it's making a call to another microservice. This case using the fame client and looking in, it looks like it is indeed calling to the transaction service. Now I'm going to look for the same resource inside a transaction service. So I'm going to go for transaction resource. And again, unsupprised immediate transaction resource is calling another client, which is calling the credit service. Now at this point, Jamie is experienced and knowing that these services kind of go after the other, Jamie decides to go all the way to the end, to the final service that is actually doing the work, which in this case looks like the external billing service. And there's an external billing resource here. And let's see whether the problem is at the actual end. So basically the end of the road, as they say. And when Jim is looking at the controller, so it does appear as if the transaction is finished correctly. It doesn't look like there's some sort of problems here. So Jamie has a choice. It's possible to go back to the last service we looked at, which is the credit service, so we can go back from the end. Again. It's a matter of decision in really long chains. It might make sense to go a bit in the beginning and then a bit at the end and backwards, which is what Jamie decides to do. The service before external billing service is fraud resource, let's look for fraud resource and immediately the problem becomes apparent. There is a check for fraud method here that throws an exception. And to those familiar with eclipse, this is intellij. This is what happened when you author generate a catch block in eclipse there is a try and a catch and you're expected to fill the catch and the developer did not fill the catch. What would probably happen here, I assume is the developers might, after checking for fraud and throwing the exception, propagate some error downstream. A lot of ways to handle it, but the bottom line is an exception was thrown and caught and we were none the wiser. And I think this is basically a good time to stop the demo and go back into the problem at hand, which is how to make a system observable. We talked about the fact that observability can be defined as the ability to understand how the systems work without shipping new code and in this specific situation, you see the frustration. It would be really hard to understand what's going on without shipping new code to investigate. This was a very short, very simple application and it took me, or rather Jamie in this situation. It took him precious minutes of a customer complaining about the application to investigate. And this again is a small application, a small use case. This happens way more often in larger system with many different developers working on the system and the system being very complex. So perhaps there is possibility to talk again about how we might be able to account for all these things. Our tools don't tell you. How can we make sure that during an on call situation we would be able to ask more questions and get more answer without resorting to diving through the code in order to understand exactly what is going on, but by getting real production information and understanding what we can do with it. And I think the best way to define this new practice is continuous observability, which is this streamlined process we can have for asking new questions and getting immediate answers. And there are a few tools in many different areas that can help you depending on how deep in the stack you want to go, from which perspective you want to observe. I work for Lightrun. We build a continuous observability toolbox. You're more than welcome to look inside of it, but the core of the issue here is that you should have some way of exploring these situations in real time. And up until then, I do have a few suggestions for you on how you can win in these situations and make the experience not suck as much as it did for Jamie. Basically looking at the information or looking at the application and being very frustrated that they have to dive into the code back again to understand what's going on. And the first thing is you got to choose to love it. It's rough and on call is not fun for anyone, but being angry and self absorbed and being totally annoyed with all situation is not going to help. Step out of your shoes for a second. Remember that everybody's oncall, including the Dells engineer in front of you and the support engineer on the other end, and be a part of the team work together. But having said that, as a developer, you have the obligation to be as prepared as possible. You know how the application is supposed to behave, so you should be able to answer their questions much faster instead of trying to kind of rummage through your things to get any answer. The DevOps team lives inside the monitoring systems they know, and the NOC team they know the tooling very well, in and out and they can answer your questions. So you should be prepared the same. And I think what really helps with preparing is documenting. So you see something in an oncall session and you think it's a singular location and you will never see it again. You are incorrect. Sit down. Document it for your future self, right, for three, four, six months down the road when it's going to come again and you're going to regret you haven't written it down. Write everything up, make sure it's prepared and come with it in searchable or an indexable form for your next session. And speaking of another set of tooling, I mentioned them before. A lot of teams have these more, let's call them case specific tooling that are used for debugging. Specific thing, emergency debugging tools that are used for debugging specific problems that tend to happen in their own systems. It can take the form of tooling built internally, it can take the form of shell scripts, it can take many different forms. And it's on you to make sure that you have them available. Having them enables you to solve the problems much quicker. And you wouldn't know that you need these tools, or you would be able to have these tools and help you if you wouldn't write proper postmortems. And that's not just documenting the steps that you took, that's documenting why the situation has occurred, what step you took to solve it, and what steps are needed in order to make it never happen again. There's a full process here, and a good postmortem is composed of information from all the different teams. What is missing in the player books for the support engineers, what is missing in the dashboards of the developers team? What tools? What would the developer have that would make it easier for them to solve the problem? And having said that, there's also this kind of humility that developers should take here, because if you're running a large enough system, there are people dedicated to just keeping it healthy. So if there's something that relates to the underlying infrastructure, ask questions. Go ahead, be part of a team and know that you don't have to know anything. It's perfectly fine to not know all the things and first ask questions and also be blunt about things that are just not under your control. If it's obvious to the application is behaving, but it's missing some resource, talk to the DevOps team. Explain to them that this resource is missing. This is why, please help me. I think this is pretty much it for this talk. I'm really happy you guys joined me. Feel free again to reach out to me over Twitter over GitHub. You can open an issue in one of my repos, but that's not a desirable method of communication. It's much better if you use Twitter or my email tongue at Lightrun. Please make sure to check out Lightrun. If you find this talk interesting. I think you would be pleasantly intrigued by everything we have to offer and see you again soon.

See all 45 talks at this event!

Conf42 Cloud Native 2021 - Online

April 29 2021

Getting back to sleep as soon as possible for the on-call developer

Video size:

Abstract

Summary

Transcript

Tom Granot

Developer Advocate @ Lightrun

Join the community!

Featured event

2025

2024

Info

Conf42 Cloud Native 2021 - Online

April 29 2021

Getting back to sleep as soon as possible for the on-call developer

Video size:

Abstract

Summary

Transcript

Tom Granot

Developer Advocate @ Lightrun

Join the community!