Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, my name is Tom and this is my talk getting back to sleep
as soon as possible for the on call developer or how to win at on
call. This is me, Tom Granot. I'm a developer advocate working for Lightrun and
previously I was a site reliability engineer working for a company called Actix that
does distributed systems in Munich, Germany.
If you want to reach out to me, you can do so over Twitter or
using my email, which is Tom granot, liverin.com.
Let's go. I'd like to introduce you to Jamie. Jamie is a Java developer
with about five years of experience working for the enterprise application company
that produces Oasis. Very simple,
run of the mill enterprise application. And specifically,
Jamie works on the transaction system inside the application that is in
charge of allowing users to purchase various things inside the application. Now,
Oasis, being a system that is viewed by many corporations worldwide, is active
24/7 there's always one developer on call. Tonight it's
Jamie, and also there's a DevOps engineer and one or two support engineers that
are there for more minor escalations. In addition,
Olasis wide, there is a NOC team that is in charge of monitoring all
the different systems. And I think before we actually set
the scene and dive into exactly what happened in the current incident,
it's worth noticing or explaining how the stack
for the transaction system looks like. So it's a microservices based
application. There are different services, a credit service, a fraud service,
an external billing API, a gateway, and so forth, that are in charge
of the proper operation of the application. And each service, or more
correctly, each system inside the Oasis application has its
own monitoring stack propped up by the DevOps engineers
in order to separate the concern. Whether or not this was the right
decision is something for the past. Jamie wasn't there when it was decided. It's just
the current situation. But the monitoring stack is something Jamie knows
and appreciates. The LK stack elasticsearch log Stack in Kibana
and is a very comfortable way of understanding what's going on inside your application.
Now, the problem with oncall is that, and Jamie has
been doing it for a long while, so it's
ingrained inside the brain that on call always brings
new challenges. So it's either something completely new that you've never seen
before, it might be something that you've seen before, or another team member has seen
before, but nobody documented. It might be something that is completely
outside of your scope, some external system failing, but there's always something.
And to properly understand that, why so
many things happen in production and in these on call sessions.
It's important to understand the players, the people actually doing the work during these sessions.
So the developers engineers are usually confined to the infrastructure. They know
how the infrastructure works. They can mitigate any blocks
or timeouts or problems in the infrastructure level. And the support engineers
are more playbook followers. They see a set of circumstances
happening, and they know how to operate twist knobs, turn switches,
and off and on and off. And not to solve various problems, but just
being immersed in the environment, just understanding it
completely, just being part of it, is very hard due to the noise in
the system. There are so many things going on at any given point in time.
There are logs coming in and support tickets and phone numbers, and customers
polling you messaging, sorry, on slack,
other teammates messaging on slack. It's basically lots of noise all around.
And whenever there's a problem that's remotely complex that none of the above people
can solve, developers are called, in this case,
Jamie. And in our specific case tonight, a wild problem
has appeared. Some transaction is failing. The support engineers are looking
and they're trying to operate a situation, trying to kind of progress with it,
and they're not making headway. They get stuck. The playbooks are not working.
It's really hard to understand what's going on. And so they turn to the DevOps
team to see whether it's maybe an infrastructure problem, something not in the application logic,
but something actually inside the infrastructure itself. And the DevOps team
look at the monitoring systems, and nothing is apparent. So, as I mentioned before,
when something like this happened, some remotely complex application logic problem,
you call the developers, in this case JB. It's really important
to understand what it is that a system tells you
about itself during production. The absolute best situation
with a running system is a situation of observability. It's a password.
It's been thrown around many different times lately. The bottom line is that
it can be defined as the ability to understand how your systems work
on the inside without chipping new code and without chipping new code. Bit is
the core part here. It's how we differentiate a system that is
just monitored and a system that is fully observable, instead of relying
on adding adfixes and adding more and more instrumentations and logs to
the running application in order to understand what's going on. Instead, the system,
in the way it was originally instrumented, tells you all the story
and tells you what is wrong with it by allowing you to explore in real
time what is going on. And I think it's also important to differentiate
what a monitoring and observability stack gives you. The current tooling and
things that we have inside our oncall situation give you and what
they don't. To be completely honest, many monitoring stacks today
are really good. They tell you a lot of the information.
It's easy to understand many of the problems that it used to
be really hard to understand. And the first primitive, if you will, in this situation
is the log line. So, logs that are being emitted out of the application in
real time are the things that tell you what is happening at any given
time in the application. It's a timestamped event log of what is
going on inside the application. But this is only on the application level. The application
is not running in a void. It's running on a piece of machine, or multiple
machines that have their own health situation. The amount of cpu they
use, the amount of memory they consume and so forth. And if the machine beneath
the application is not feeling okay, it's overloaded, then the application will
show. But if we go even one step further, these are two
kind of pieces of information that are bit further apart from one another.
So the log tell you this timestamped event log of the application. And the
itmetrics tells you how the machines beneath field. But in
order to understand how the actual service, which is a piece of software that is
intended to bring value to the customer, how it looks like from the customer
perspective, there's also another set of metrics, service metrics,
business level metrics, SLA metrics, depends on the organization.
They're called differently in different places. And these metrics relate
more to the purported value the system gives to the users.
And this could be things as how good of information the system is emitting,
how much time it takes it to emit this information, how many outages
there are and so forth. This is the value as it is perceived by
the customer from the software. But this is not granular enough
information. Sometimes you want to understand what's going
on after you call a specific endpoint. And distributed tracers.
These are the tracers I'm talking about here. Stuff like zipkin Yeager. There are
many of those tools. They allow you to inspect an endpoint and say,
all right, this endpoint takes x time to respond and it
calls three downstream API endpoints, each of them taking another
third x of time to respond. And this allows you to more
kind of surgically go inside and better understand how
can API endpoint performs like in real life. But even that is
not always enough. And sometimes when the application itself
is constructed above, a virtual machine like Java is over the JVM
you have a set of tools called profilers. And it's not only limited to
tools running over virtual machines. There are many different
type of tools that are called profilers that work at different levels of the stack.
What I'm referring to is stuff like J profiler, things that profile
the JVM and tell you how another layer of obstruction in the middle
is failing. So not the machines running beneath, not the application
running above, but the JVM in the middle. And speaking of
tools that we don't often see in production profiles have incurred a lot of overhead.
Debuggers are also another set of tooling that allow you to explore random
code path. Basically step in, step out of different flows and
understand how a user behaves like in real life. And these debuggers
are again, there are remote debuggers that can be used in production,
but usually it's a tool used inside development to walk through
different code paths. There's another subset of tooling called exception or error
monitoring and handling tooling. And these tools enable you to enrich
the information you receive once something bad happens. So an
error has happened or can exception was thrown. And these tools
enable you to get a better grasp when things go bad.
So when things go out of all, not all exceptions are bad. But I think
the more interesting part is talking about what our tools don't tell you. And the
first and foremost thing that comes to mind is incorrect application state. This can come
in so many forms, it can come as a request coming from the product that
was incorrectly implemented, an incorrect request coming from product that
was correctly implemented. Whatever set of circumstances
that caused the application state to be incorrect, perhaps edge case that
wasn't tested for. And again, there are many different ways this could happen.
This ends up being a detriment on the user's experience.
The user using the application doesn't really care what we did wrong.
The only thing the user cares about is the application is not working
and our tools will not pinpoint and say this is wrong. Logic.
Logic is logic. It's difficult to account for unless you tell the system exactly what
it is supposed to do. And this is something that even modern monitoring
stacks don't really account for. Another thing that's hard to understand,
but is more mechanical if you will, is quantitative data on the code
level. So assume you're getting an OoM for some reason and there's a data structure
exploding in memory and that data structure exploding in memory is the thing
that's causing the OOM. This is not something you would usually be able
to determine unless you've instrumented the metrics beforehand.
So understanding the size, visualizing a graph of how the
data structure behaves over time, unless you actively ask for
it, it is not information that will always be readily or even at all
available to you. Another thing that is more related to developer mistakes,
or perhaps not mistakes, but just things you don't always
catch during developers. I'll swallow the exception, a situation in which
an exception was thrown and caught and nothing was actually emitted doing that.
So some side effect that was supposed to happen did not
happen because of the exception. Maybe some piece of information that was supposed
to be emitted. Not even a side effect, but some piece of information was not
emitted. And this, when looking at the application again, trying to
observe it and understand what's going on inside of it, it will be very hard
because there's a piece of the logic you're not seeing. It's literally swallowed by
the application. This is not only about developer level things.
There are some things where you use some external
API, not all at your control, some external API completely
abstracted away from you that behaves in weird ways. Maybe the format
is not the format you expected, it does not conform to specification.
It times out in ways you did not expect. The state
after you made a request, changes on
the next request, and so forth. So many moving pieces in production
that you can always kind of ask questions, can always ask to dump
the information, but that usually requires asking the question. Adding a logs
to dump that piece of information into the logs.
Sometimes this is actually not allowed, especially when PCI compliance or
HIPAA is involved. And then you would need to not log
the entire object because there will be private information there, but instead
find a way around it. And there's also this more
nuanced problem when the application state is
correct, but the user took the incorrect flow in order to get
where the user is currently at, and it's difficult to understand how to get the
user out of that situation. This happens, for example, when requests
are coming with will request parameters and the user is receiving not
the experience of the application they expected to receive. This can
be as simple as perhaps a menu not opening correctly,
and can be as bad as what recently
happened with GitHub when a user actually logged into the wrong
session. So this is the best example of unexpected user flow.
The other piece of the puzzle, I guess, which is things that are
hard to reproduce and only happen in specific conditions, not for
specific users, but just under a specific set of conditions, are races.
So those are shared resources, and the order in which two parties
reach to it matters. It shouldn't, because it should be everybody gets
the same treatment, or everybody gets the correct flow of operations.
But sometimes races do occur, and when they happen, they're hard to reproduce.
You have to wait for the exact set of circumstances for them to happen.
And even then, most monitoring tools will not tell you why the race happened,
or even which set of circumstances, the full set of circumstances
that is needed to reproduce the race. And therefore, many races end up
being left in bugs for time and time over, until basically,
we'll see that bug the next time it appears. And one important thing
to mention about these types of problems is what happens
when you move away from developing locally and you're moving to deploying your applications into
production, when there are a lot of replicas of your application, and perhaps
if you're a monolith, there are specific parts of the application
that get more load than other parts, and they affect the former parts
in the application. There are all this consideration when we're working with either
a large load on specific pieces of the application. Perhaps things you
didn't actually work on that belong to somebody else, but you steal resources from them.
Or when you take the same application and run it across multiple
instances in order to better understand, or perhaps
not better understand or provide a better service by
allowing for multiple instances of the same piece of code,
weird things can happen in the in betweens. And I just
want to show you a demo to better explain exactly what I'm talking
about, show you what these type of problems looks like in the wild. But before
that, it's important to understand the context that drove us
here. Basically what we know. And the support engineers received the
ticket, which means they know the specific customer that's acting up, and they
even know the specific id for that user. It's 2911.
And the DevOps team also knows
that the monitoring system is not showing any problems.
The metrics look okay, and I just want to remind you
that there are about six services inside the current OS
transaction system, and there's a monitoring stack based on Elk.
Let's assume that Jamie is being called into the situation, and all he
knows is that the metrics look okay, and there's a specific user id that's acting
up. This is the Kibana dashboard, or the Kibana entry
screen. And that Kibana board is used heavily
by the DevOps team, and perhaps even the support engineers, in order
to understand what's going on. But from Jamie's perspective, if the
metrics are okay, and the support engineers said that
the usual things don't work, then the first thing, or the second thing, perhaps that
Jamie will do, will go into the log trail. So you go into the log
trail and you look for the log lines for the specific user to
see in your own eyes what is happening with it. And I'm going to pass
that number here. And we can see that the log are actually very clear.
It looks like there's a step by step, service by service
oncall into each following the length of the transaction. So I
can see that the inventory resource inside the inventory
service started to validate the transaction. Looks like
the same service, sorry, also tries to process the purchase.
Then there's a sequence of calls to transaction service,
then credit service, then fraud service, then external billing service,
culminating in the kind of back propagation of everything else back
down the line. So because you're
looking at the logs and they're not telling you the full story, it doesn't
look like something in the transaction is wrong. It looks like the transaction
finished processing correctly.
We're basically stuck. There is no extra
path to go here. There is no more information
the application tells me about itself out of the box that hasn't been
checked by me or by other members of my team that explain
exactly what's going on. And it really depends.
The playbooks differ here from oncall team. From oncall team. There are some people that
have specific tooling that are used to solve these type of problems,
basically get more information. Some people throw the same environment over
staging. Some people have a ready made database with shadow
traffic or shadow data that enables them to test things in a kind
of more safe environment. Some people just connect the remote debugger and walk
through the flow piece by piece using that specific user,
because that specific context of the user because it's allowed. It really
depends on the exact situation that you have. And I
just want to mention and kind of focus
on this feeling of helplessness, if you will. Right.
It's not clear what's going on, and it's really annoying.
It's not something you accounted for. Something is happening,
and the application is not telling you what it is. But Jamie
is a cool headed, very down to earth person,
and the code itself is the next place to
go. So the next thing is indeed to go into the code, and I'm going
to open my intellij that is simulating Jamie's
intellij. And I'm going to start walking through the code in order to
understand what exactly is going on. And the first thing
is looking for the inventory resource, which is the thing that starts
all the things. I'm going to look for inventory resource, and you
can see there's a controller here and as expected, it's making
a call to another microservice. This case using the fame
client and looking in, it looks like it is indeed calling to
the transaction service. Now I'm going to look for the same resource inside
a transaction service. So I'm going to go for transaction
resource. And again, unsupprised immediate
transaction resource is calling another client, which is calling the credit service.
Now at this point, Jamie is experienced and knowing
that these services kind of go after the other,
Jamie decides to go all the way to the end, to the final
service that is actually doing the work, which in this case looks like the external
billing service. And there's an external billing resource here. And let's see whether
the problem is at the actual end. So basically the end
of the road, as they say. And when
Jim is looking at the controller, so it does
appear as if the transaction is finished correctly. It doesn't
look like there's some sort of problems here. So Jamie has a choice.
It's possible to go back to the last service we
looked at, which is the credit service, so we can go back from the end.
Again. It's a matter of decision in really long chains.
It might make sense to go a bit in the beginning and then a bit
at the end and backwards, which is what Jamie decides to do.
The service before external billing service is fraud resource,
let's look for fraud resource and immediately
the problem becomes apparent. There is a check for
fraud method here that throws an exception.
And to those familiar with eclipse, this is intellij. This is what happened when
you author generate a catch block in eclipse there is
a try and a catch and you're expected to fill the catch and the developer
did not fill the catch. What would probably happen here,
I assume is the developers might, after checking for
fraud and throwing the exception, propagate some error downstream. A lot
of ways to handle it, but the bottom line is an exception was thrown and
caught and we were none the wiser. And I think this is
basically a good time to stop the demo and go back
into the problem at hand, which is how to
make a system observable. We talked about the fact that observability can be defined as
the ability to understand how the systems work without shipping new code and
in this specific situation, you see the frustration. It would be really
hard to understand what's going on without shipping new code
to investigate. This was a very short, very simple application
and it took me, or rather Jamie in this situation.
It took him precious minutes of a customer complaining about the
application to investigate. And this again is a small application,
a small use case. This happens way more often
in larger system with many different developers working on the system and
the system being very complex. So perhaps there is possibility
to talk again about how we might be
able to account for all these things. Our tools don't tell you. How can
we make sure that during an on call situation we would be
able to ask more questions and get more
answer without resorting to diving through the code in
order to understand exactly what is going on, but by getting real production
information and understanding what we can do with it. And I think
the best way to define this new practice is continuous observability,
which is this streamlined process we can have for asking new questions
and getting immediate answers. And there are
a few tools in many different areas that can help you depending on
how deep in the stack you want to go, from which perspective you want to
observe. I work for Lightrun. We build a continuous observability toolbox.
You're more than welcome to look inside of it, but the
core of the issue here is that you should have some way of
exploring these situations in real time. And up until then,
I do have a few suggestions for you on how you can win in these
situations and make the experience not suck as much as it did for Jamie.
Basically looking at the information or looking at the application
and being very frustrated that they have to dive into the code back
again to understand what's going on. And the first thing is you got to choose
to love it. It's rough and on call is not fun for anyone, but being
angry and self absorbed and being totally annoyed with
all situation is not going to help. Step out of your shoes for a second.
Remember that everybody's oncall, including the Dells engineer in front of you and the support
engineer on the other end, and be a part of the team work together.
But having said that, as a developer, you have the obligation to be as prepared
as possible. You know how the application is supposed to behave,
so you should be able to answer their questions much faster
instead of trying to kind of rummage through your things to get
any answer. The DevOps team lives inside the monitoring systems
they know, and the NOC team they know the tooling very well, in and
out and they can answer your questions. So you should be prepared the same.
And I think what really helps with preparing is documenting. So you see
something in an oncall session and you think it's a singular location
and you will never see it again. You are incorrect.
Sit down. Document it for your future self,
right, for three, four, six months down the road when it's going to
come again and you're going to regret you haven't written it down. Write everything
up, make sure it's prepared and come with it in searchable or an indexable
form for your next session. And speaking of
another set of tooling, I mentioned them before. A lot of teams have these more,
let's call them case specific tooling that are used for debugging.
Specific thing, emergency debugging tools that are used for debugging specific
problems that tend to happen in their own systems. It can take
the form of tooling built internally, it can take the form of shell scripts,
it can take many different forms. And it's on you to make sure that you
have them available. Having them enables you to solve the problems much quicker.
And you wouldn't know that you need these tools, or you would be able to
have these tools and help you if you wouldn't write proper postmortems. And that's not
just documenting the steps that you took, that's documenting why the situation
has occurred, what step you took to solve it, and what steps are
needed in order to make it never happen again. There's a full process here,
and a good postmortem is composed of information from all the different
teams. What is missing in the player books for the support engineers, what is missing
in the dashboards of the developers team? What tools? What would the developer have
that would make it easier for them to solve the problem? And having
said that, there's also this kind of humility
that developers should take here, because if you're running a
large enough system, there are people dedicated to just keeping it healthy. So if there's
something that relates to the underlying infrastructure, ask questions.
Go ahead, be part of a team and know that you
don't have to know anything. It's perfectly fine to not know all the things and
first ask questions and also be
blunt about things that are just not under your control. If it's obvious to
the application is behaving, but it's missing some resource, talk to the DevOps
team. Explain to them that this resource is missing. This is why,
please help me. I think this is pretty much it for this talk.
I'm really happy you guys joined me. Feel free again to reach out
to me over Twitter over GitHub. You can open an issue in one of
my repos, but that's not a desirable method of
communication. It's much better if you use Twitter or my email tongue
at Lightrun. Please make sure to check out Lightrun. If you find
this talk interesting. I think you would be pleasantly intrigued by
everything we have to offer and see you again soon.