Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, I'm Shai Almog.
Today we'll talk about developer observability,
which I think should be the fourth pillar of observability.
But first let me say a couple of things about me.
I wrote a few books, including one about debugging,
which covers all of the topics we'll discuss here.
I've worked in this industry for decades, in many companies
and as a consultant. I also worked
at sun, which became Oracle, etc.
You can contact me over my socials listed here and follow me
on Twitter, LinkedIn, etc. Check out my
YouTube channel and my blog for videos and posts in
this style. This is my April book
titled practical debugging at scale. Everything in today's
talk is in there, and a lot more.
I have another book about Java coming out. In a couple of months
it will cover Java eight to 21.
Let's jump right into the talk.
Most of us know the three pillars of observability.
They are logs, which are mostly written by
the developer but ingested and managed by Ops
traces which are usually pretty seamless for developers who
quite often aren't even aware they exist,
and metrics which help us measure and quantify
pretty much everything but our application.
They are all great and essential as
part of a healthy production environment.
But they all have a few drawbacks.
The first is the fact that they are static.
I can't add a log into a production system.
A developer needs to make that change, then go
through a process to add the log and
thats takes a while. This is
true for metrics and traces as well. While traces
are typically seamless, thats is not universal
and it's sometimes hard to understand a trace
without custom spans. This is best
summed up in a visual was thats is
a familiar pattern. We have a problem in production
so we need to add more information, a log,
a metric, or sometimes more. Then we need
a pull request or a similar review process.
This can take a while. In some companies we have
double reviews which can really stretch the time.
Then we go through CI, CD and possibly a QA
process beyond testing to finally get
that code into production. This whole process can
take days or sometimes more. If we don't have
a fast CD cycle, then in production
we need the user to reproduce the problem.
This might be a flaky problem that's
hard to reproduce. This might take a while.
Then when we review the problem, it is often the case
that we didn't log enough. We don't have the
exact information that we need.
We then need to go all the way back to square
one and start that cycle all
over again. That's the
CICD cycle of death.
I use this term a lot and every time I describe
this I get a lot of nods from the crowd. We all
know the story. It's universal.
We all suffer through that cycle when tracking an elusive
bug. It's a deep pain in our industry,
but things can be worse.
Yes, worse than this painful cycle.
The solution is often to log
more just in case. This seems like a sensible
solution. We solve the problem with lack of data by
adding a lot of data. To be fair, that does
work on some occasions, but it is one
of those cases where the cure is worse.
Thats the disease. Logging ingestion
alone can account to a third of
the total cloud costs. That is often much
more than other costs combined.
Logging seriously impacts application performance.
This has a cascading effect
of requiring additional resources,
slowing the application, etc. Cetera.
Other observability took have similar impact on
performance and on storage.
This was discussed a while back in Reddit and
I love this quote from one of the posters.
A team just set a log level too
high and burned through $100,000
in days. This is a
very common scenario, although this is indeed
extreme. Overlogging can kill
projects, companies jobs and
kill the rainforest. The amount of pollution,
production by overlogging and wasted resources
is absolutely frightening.
This might have been worth it if it actually solves
the problem, but more often than not,
it doesn't really help. We can log everything
due to privacy we can't log everything due
to privacy and security concerns.
Looking over a huge mess of logs and metrics
makes the process of tracking an issue into
a needle in a haystack. At best,
it slows us down. At worst,
it has a lot of redundant memory
and a lot of storage, but is still
missing that valuable information that we
need because, as I said, we can't
truly log everything. We can't log
the truly valuable data.
Another limitation is the heavy focus
of these tools on DevOps. Developers write
logs, but they're ingested and handled by DevOps.
Production issues are often handled by SRE.
This disconnect means that as a developer,
you would need to log something that someone whose
job you don't understand fully
would find useful. That's problematic.
Furthermore, the tooling is very much focused
on the DevOps point of view. Instead of dealing
with source code from the IDE, the tooling talks
about agents, entry points and other ideas that
are less familiar to developers.
With the shift to microservices and serverless
systems systems
became resistant to debugging.
In fact, the only way some developers
can check their code is through tests.
That isn't ideal. It means that when they have an
unforeseen problem, they need to use tools designed
for DevOps to understand the problem, and then
try and create a test case for that problem.
This is a major step backwards.
Developers need observability just as much as
DevOps. While a vast majority of production
problems could be handled by Ops, some of the
hardest problems to fix are bugs in
the code. I'm not talking about crashes.
That's where most of us go automatically when thinking
about production problems. Most production problems
are cache misses in the code or stale cache.
It means an item is missing in a listing. It only
happens in production, and we have no clue how to fix that.
Existing observability is usually very opaque in
such situations. Ops don't even know much
about such issues. They might flush the cache,
but that's a blunt instrument and a poor workaround.
Developers need their own observability,
but it needs to be different from today's
observability. The first principle of
developer observability is to meet developers
where they are working in the IDE
isn't a requirement. Some of these tools work in
the browser, which is also fine as long as
they use terms and environments that are familiar
to developers. In such a tool,
we would inject a log to a line of code.
We discuss metrics in terms of specific
lines of code, not in terms of spans,
entry points, etc. Cetera. Ideally this happens
directly in the IDE, since that's where developers spend their
time. But the bigger thing is the ability
to inject observability metrics right into production
code. Thats means I can add
a new log metric or snapshot without
going through the whole cycle like before.
Remember this diagram? This is pretty complex.
With develop observability we can simplify this considerably.
We can remove two stages from the process.
Developers can instantly inject a log or
a metric to production without any
changes to the code itself. We can then
reproduce the problem while coordinating with the
end user experiencing it. I can thats with
a customer while they reproduce the issue. The great
thing is if I don't have all the information
they sre still on the line, I can
immediately add another log or metric and ask
them to try again. This completely
changes the way we look at production.
I used a very loaded word there,
injecting. In fact, when I was working for
a developer observability company, I was
prohibited from uttering the I word.
It's a scary word. It means we
change code in production and the typical association
we have with that word is very negative.
Injecting bugs, changes,
or even injecting a security vulnerability.
I get exactly why my employer didn't want
to be associated with that word, but this
isn't the only tool that uses injection to implement functionality,
so it's not the end of all.
To be fair, though, security is a major concern.
Most of the tools in the field have similar approaches
to that's
a key aspect in the security is
the management server. As developers we don't
access production. This is the job
of DevOps. It's segregated the developer
observability backend is accessible to the developers
like any other observability server. The actual
backend communicates only with thats server.
Thats is pretty familiar if you worked with other observability
tools, but is very different from other developer approaches
such as remote debugging. This means thats
even if there is a weakness in the
injection code, it would be very hard to exploit
as even the developers don't have direct access to
the back end servers. Furthermore,
some developer observability solutions enclose the
solution in a sandbox which executes everything
in a controlled environment. Let's say I
add a log and it takes up too much cpu,
or I add a conditional metric that
tries to modify the application state in the conditional
statement. Some developer observability tools will detect
both of these scenarios and limit the amount of resources
or block execution entirely.
Since all access to the system is done through
a back end server, it's a trivial matter to keep
an administrative log. That means thats we can
track every operation performed by any user.
There is always a record. If a user tries
to steal private information, it will be logged and
can be used has evidence.
Some information is problematic, such as
credit card numbers. Thats is called personally identifiable
information, or PII for short.
We must remove such information from logs,
sometimes by law and sometimes by regulation.
Ideally we will catch that in the review,
but if a log is injected, it might accidentally
print something that shouldn't be printed.
We can recognize those patterns and implicitly
block them from logging. This is done with
the PII reduction functionality supported
by some tools, but the most important
feature for security is block lists.
Imagine a disgruntled developer within our organization.
Thats developer can add a log to the user login
code and print all the usernames and passwords.
By the time we notice it in the administration log,
he might be in a different country with all of the
ill gotten gains. We can stop that with a block
list. With it we can block a developer
from logging or adding metrics to a specific set
of sensitive files, classes or packages.
I think we had enough theory. Let's do a
short demo of one such product, Lightrun as
a disclaimer. I used to work there, but thats was last
year. I no longer do. On the left side
you can see intellij idea, my id of
choice. On the right side I have an application
station that counts the prime numbers running on
a remote server. We can see the console of
that demo. The application doesn't print any logs
as it does the counting, which makes
it hard to debug if something didn't work there.
In the middle we can see the currently running agents
which are the server instances. We also see tags
above them tags let us apply an action to a group
of server processes. If we have 1000 servers, we can assign
the tag production to 500 of them and then perform an
operation on all 500 by performing it on a tag.
A server can have multiple tag
designations such as East Coast,
Ubuntu 20, green, et cetera.
This effectively solves the scale problem typical
debuggers have. We can apply observability operations
to multiple servers. Here we have only one tag
and one server process. Because this
is a demo and I didn't want to crowd
it, I can add a new log by
right clicking a line and adding it.
I ask it to log the value of the variable I
and it will just print it to the application log.
This will fit in order with the other logs,
so if I have a log in the code, my added
log will appear as if it was written in the code next to it.
They will all get ingested into services like
elastic seamlessly, or you can pipe
them locally to the IDe.
So this plays very nicely with existing
observability while solving the fact thats traditional
observability isn't dynamic enough.
The tools complement each other,
they don't replace one another.
Notice I can include complex expressions like method,
invocations, et cetera, but lightweight enforces
them all to be read. Only some developer observability
tools do that, while others don't,
but the thing I want to focus on is this.
Notice the log took too much cpu and
Lightran pauses logging for a bit so it won't
destroy the server performance. Logs are restored
automatically a but later when we're SRE
cpu isn't depleted. This is the
sandbox I was talking about earlier.
With developer observability, we can add debug information
in areas that don't make sense since
the information will be removed once we're done,
it isn't a big deal. A log that might be too expensive
as it will blow up ingestion costs because
it's on a line that is invoked very frequently,
can be added for a few
minutes and then removed. That isn't a problem,
but the most important aspect of developer observability
is insight at a developer level.
DevOps know the features that
are used frequently, but they can't
tell if a specific method or block of code
is reached. With developer observability, we can detect
if a block of code is used and get applicable
statistics. If we're considering a
code change, we can evaluate the risk and
reward beforehand by adding a
metric to that block.
Developer observability is a new tool
for a new audience, but it's still an
observability solution. First and foremost,
when you inject a metric, it integrates with your
existing dashboards. When you inject a
log, it integrates with your ingested
logic. Developer observability is about
making the crucial benefits of observability accessible
to a new crowd, a crowd of
developers, which is the most important
goal. When I give talks
to DevOps, I often ask them about
observability, and a surprising
small number of developers are actually
using observability tools on their day to day basic
basis. They hear about observability solutions,
they know about them, but they don't truly use them.
Developer observability is a way to
open the world of observability to
the developer community at large. And this
is the time in which developers truly need
these sorts of solutions. With the migration
to microservices and serverless,
they are figuratively blind by
these new architectures, unlike before.
Thanks for bearing with me. I hope you enjoyed the presentation.
Also, check out debugagent.com, my book
and my YouTube channel where I have many tutorials
on these sorts of subjects. Thank you.