Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are youre an SRE,
a developer, a quality
engineer who wants to tackle the challenge of improving reliability
in your DevOps? You can enable your DevOps for reliability with
chaos native. Create your free account at Chaos
native Litmus Cloud hello
everyone, glad to be here at SRE 2021.
And today I'd like to share with you the journey that led
us to start using distributed tracing in handling performance
issues in our system. And I'd like to give you some practical
tips on how you can start with distributed tracing and how to
effectively use it in your performance investigations.
Word about myself my name is Dotan Horovits. I'm a developer
advocate at Logs IO. I've been around as a
developer solutions architect product guy.
I'm an advocate of open youre and communities in
general and the CNCF, the Cloud Native Computing foundation in particular.
I organize a local CNCF chapter in Tel
Aviv, so if youre around, do come and visit one of our
monthly meetups. I run the open
observability Talks podcast, check it out and in general
you can find me everywhere at Horowitz. So if you're
tweeting something interesting out of this talk, feel free to tag
me and about
us we pride ourselves in being the
masters of observability logs,
metrics traces all around.
But we weren't always observability masters.
So let's rewind a couple of years back before
we had the tracing.
Back then we were the best. We were the masters
of logs and metrics. Our system is
instrumented inside and out around logging and metrics.
We're talking about the system in microservices arch
multi cloud, on AWS and on youre and
multi region multi tenant running at high scale.
And we felt we had it all figured out until
the performance alert.
So it was a UI performance issues coming
from the metrics. Obviously. Looking into the metrics we
started seeing the application started loading a
little slower with many requests taking longer
to respond. And now, as anyone
here in the audience, SRE, DevOps, dev, whoever know
youre need to figure out where it comes from.
And our service handles thousands of requests per minute with
a serious amount of internal server communications.
So this is the conditions that we needed to investigate.
And the first attempt, as always, is to
turn to the logs for answers.
But this time it wasn't that easy. First of all,
we had tons of identical data,
but worse, it seemed to be
like a random decreasing performance issues in different
flows all around our application. It was really hard
to point to one thing in particular,
and even harder to connect the problem to a specific service.
What made it more difficult to investigate through the logs is
that the request sequences were pretty long.
It was really hard to get the details from the logs and
figuring out what called which and how
the flow goes. Bottom line, youre logs
weren't enough for this investigation.
So the second attempt was to enhance
the logs. The idea was to add a
request id to each and every one of the logs, so that we
can look at all the logs and visualize the
logs of all of the same request in Kibana.
So if before, we couldn't really understand
which logs belong to this specific problematic request,
and some requests are good, some requests behaved oddly.
This is a way to bucket it, and it sounds pretty simple,
right? Add the request id, then visualize based on request
id in cabal. So it
wasn't that simple. First of all,
it was very difficult to handle the propagation of
the request id. We needed to go into each
and every piece of code service
microservices that we suspected has to do
with this request, and add an
incoming parameter to propagate, to bring in the request id,
then obviously put the request id as part of the logs, but also
then send it downstream, in any downstream call
from that service to any others service or any other operations of
the same service. So we needed to handle all this request id propagation.
A big headache. So that's
in the coding. And then on the visualization side, it was pretty
painful to visualize traces using kibana. Essentially what we got
is a bucket of all the logs that share the request
id, which is nice, but you can't really understand
the relationship and which invoked which and
the relationship between them. And also
it was pretty limited in context, so we had the context
of all of them belonging to request id to the same
request id, but not much more than that.
What you really needed to see was the communication between
the microservices. So we needed better visualization
options to tricky a request and analyze it
throughout the request flow.
On the coding side, we also needed easier instrumentation because
it was a real, real pain for us, and there was
no chance that we'd be able to convince other teams in charge of other
microservices to go along this path and start adding these parameters
and handling this request id propagation. So definitely
easier instrumentation is another challenge.
And then we reach the third attempt.
In fact, strike those is a charm.
We reach the realization that we need proper
distributed tracing.
Distributed tracing is in fact those tool for
pinpointing where failures occur and what causes poor performance.
And then you can ask questions, where did the error occur?
Where did the performance degradation occur? Where's the critical
path? Those sorts of questions are exactly what distributed tracing is meant
to answer and how does it work?
For those in the audience that are not fluent in distributed
tracing, the way it works is that each
call in those call chain creates and emits
a span for that specific service and operation.
You can think of it as a structured log for that matter.
And this span includes a context, the trace
id, the start time and the duration of that specific span,
the parent span that called it and so on. And this context
propagates between the calls in the trace through those
system. And that's thanks to client libraries,
sdks, traces has they're known that
take care of that. So if you look at on the screen
on the left hand side in this tree graph,
ABCDe are all spans or service and operation.
And then we can see a is the root span, the one that got those
client request, and then a calls b, then b
calls c and then d, and then getting
back to a, a calls e. So that's in the context propagates from
a to b to c and so on. So this is the context propagation.
And each such span, ABC and so on
emits at the end, when it ends, it then emits the span
outbound, and then there's the back end that collects all these spans
that my application emits and reconstructs the trace
according to causality and then
visualizes it. And the typical visualization is the famous
timeline view or gun chart. You can see an example here
on the right hand side of the screen, the same ABCDe
on a timeline you can see. So a, the root span obviously takes the
those time from beginning to end because this is the full request.
But then a called b, B called C and
then called D, and then when D is finished, when B finished, then battling
e and then going back to a and finishing the entire request.
So it's very easy and very clear to see that from
a timeline view. And this is why it's a preferred view for investigating
performance issues in the system.
So we realize that we need distributed tracing, which is great.
Now the question is which distributed tracing
tool are we going to use?
And we are open source advocates. We use the elk
stack for logging and Prometheus for metrics and so on.
And obviously we turned to open source for tracing
as well. We looked into Yeager
and Zipkin, two open source candidates.
In fact we started using both of them, but ultimately
we chose Yeager. We still have Zipkin somewhere in our system.
But the main path is Yeager,
which needed to be the leading choice. And also today,
recent surveys still show that Jaeger is by
far the leading choice. Here youre can see on the screen from DevOps
path 2020. It's a yearly survey we run around the DevOps
and you can see that 33% or more from
those using tracing use Jaeger.
So Yeager is those choice? And what's
Yeager? Just in a nutshell, Yeager was
built by Uber and then
they open sourced it, sorry, and contributed to the CNCF
in 2017. And then it reached
production of the CNCF by 2019. So it's a
pretty mature project used in production by
quite a few companies and organizations.
I won't go into lots of details about Yeager here, but if you
are interested, I had the privilege of hosting Yurish
Kuro, the creator of Jaeger, on this month's
episode of Open Observability Talk. So it was a fascinating
talk. So do check it out, the September 2021 episode
for the latest on the Jaeger project on
distributed tracing in general, including some very advanced
use cases for tracing that people usually don't consider.
So Jaeger tracing is the choice. And now what
we need to do is before we can start investigating with tracing,
we need our application to actually generate trace
data. So let's instrument our application.
And we realized that we can't instrument everything from day one,
and especially since we have a burning issue to handle
the performance investigation. So we
took baby steps and started with two
front end services, node JS services.
We used Jaeger's tracer for Node JS,
if you remember again, Yeager has a tracer SDK
per language. So we took the one for node JS, we used open tracing
API for the code, and we
instrumented all our node JS HTTP layer with tracing as
well has the node JS frameworks that we use. Happy and
exprs a word about open
tracing API. It's another open source project under the
CNCF that is essentially an open and
vendor neutral API. It's not part of
Yeager, it's another project. And the advantage is that since
it's standard API, it means
that if later down the road we decide to switch
the backends from Jaeger to another type of instrumented,
sorry, another type of investigation and analytics backends,
it should be pretty simple to replace
just replacing the tracer library or the auto instrumentation agent,
and that's about it. So we shouldn't be having too much coding changes to do.
So going with an open standard is a recommendation.
It is important to note that today or these days,
open tracing is actually deprecated. Open tracing was
merged with open sensors to create open telemetry.
So open telemetry is the future path for instrumented. If you
start today, I'll talk about it later. But back when
we were there, the mature API
was open tracing and we went with that.
And once we have the application emitting
spans generating trace data, we can go
ahead and start investigating our traces.
And we opened the Yeager UI and
at first glance it immediately gave away two
major issues that have long been in our code, but we
didn't see them until now. It was astonishing to
actually see them, and I want to go over that with you.
The first thing we saw was what youre
can see here on the screen. This is the Jaeger Ui's timeline view
and you can see here this sort of staircase type of
pattern. This is an
obvious pattern that indicates a serial
call sequence. Call a then when it ends call b,
then it ends, call C happening sequentially. It but
in our case, in our system, there was no real need
for it to be sequential. We could easily turn it into
running concurrently and reduce the overall latency.
So we went ahead and made this change. It was very very easy to
do. And the result has a significant application
performance improvement. Right, but the bat so really
amazing. But this,
despite the improvement, wasn't the root cause of the,
if you remember the spike that we talked about before.
So we needed to carry on with our investigation to look for the
root cause of what triggered the original performance
issues in the UI. So we went
ahead and filtered the results according to the minimum execution
duration, which brought up another performance
issue. We saw one request that fetched user
data, but its duration
varied significantly. Some cases it was has short as
100 milliseconds and sometimes it was almost 10 seconds long. So two orders
of magnitude. We couldn't really figure out why it changes so
much between invocations, but one thing
for sure is that this request executes from the browser and had to
finish before we could render the screen to the user.
So that led to loading time that took
even several seconds. Definitely something unacceptable for
Ux, for user experience. So this is something that
we needed to investigate as part of the root cause analysis.
And looking at this request, we saw that this request triggers
concurrent requests to fetch all the relevant user data from various
endpoints, where each of those requests
goes through several other microservices.
So it was pretty elaborate. And some of these downstream microservices
were not even instrumented yet, reaching the back end code and
Java and the database and so on. And what
we found was one key that is
read for some user requests which was
mistakenly not cached. And this triggered a call
to the database to retrieve it, causing the latency only on the
calls requiring this particular key.
This is why the behavior was so erratic
and inconsistent and tricky to put the finger on without
tracing. So that was a real aha
moment.
One side benefit that we got, but of it that we didn't even
plan on, is that we realized that
Yeager actually provides us with live system
diagram. We don't need to maintain a
sequence diagram anymore and update it every couple of weeks. Jaeger Auto
detects the calls between the microservices
and generates a live system diagram. It's always
up to date, so that's a definite side
benefit. That convinced a lot of our engineers and the architects
and anyone else in the company to adopt tracing
finally to understand how our microservice architecture works and the interconnect
and everything around it.
It was love at first sight. Since then,
tracing has become a standard practice in those company, not just
on the front end and node js, but also on back end
and Java code and database and
frameworks that we use.
We realize that it's crucial tracing.
This is where the tracing is crucial for observability.
And considering ourselves or wanting to be the
masters of observability, we started offering it
to our own users.
So I'd like to move from our
own experience to some useful tips on how youre can
get started with distributed tracing, both on
how to instrument and how to investigate
and find common patterns to watch out for.
Let's start with the instrumentation.
Don't try to instrument every part of your system from
day one. If you have even a modest microservices arch,
it will be too complex.
Prioritize the critical services where you need observability the most
and grow the coverage over time. It is very common
to start with request endpoints in the front end and then work the
way to the back end according to the sequence.
But it really depends on your architecture and what's
critical in your system.
So prioritize your instrumentation.
And the next tip on instrumentation is use
open telemetry and onto instrumentation.
As I said before, open tracing that we used back then
was merged into open telemetry. So now open
telemetry, or Otel as it's severely called, is the
standard. And open telemetry is an
observability framework. It providers,
libraries, agents, collectors, other components that
you need to capture telemetry from
your services. And it's not just for traces, it's also
for metrics and logs. So all the different telemetry
signals are under open telemetry framework.
So it generates, it collects, and then
you can send it to any backend of your choice you want to monitor with
Prometheus, you can send it to Prometheus. You want to use Jaeger or Zipkin
or other, it's fine. So it's really agnostic to what backend analytics
tool you use. My company, for example, logs IO offers
a service that can get open telemetry
data from the collector and use it. It really doesn't
matter. The most important thing is that you have a single API and
a single SDK per language for traces,
logs and metrics, and also a unified
protocol OTLP that ultimately
will become the de facto standard also for transmitting those
telemetry. So use open telemetry. And the
second tip is on this is use auto instrumentation
to ease the coding. You want to do as little coding as possible to
get the instrumentation done and you have
instrumentation existing for many languages.
Java Ruby in fact, I just wrote a blog post on
auto instrumented with Ruby. With open telemetry.
You're more than welcome to check it out in node js. We also use the
auto instrumentation that plugs into the frameworks that we work with
like happy and Express. But you have for other languages like for Java,
spring and log four j and others.
So it's very common and it reduces significantly
the amount of coding you need to do in order to generate the span data.
Sorry, if you're new to open telemetry or instrumentation,
I wrote a quick guide on open telemetry. Just Google Open Telemetry
guide or click the URL here on the screen and I think
it'd be a very good entry point to understanding how Otel works
and getting the links to all the necessary libraries,
instrumentation and sdks that you need for your system.
So use open telemetry, use open instrumentation. Youre covered
on the instrumented side. Now let's move on to tips for
investigating performance issues. And the first,
seemingly very basic but many people don't use it well
enough, is to use the minimum duration filtering to hone in
on problematic traces that you need to look into. Youre have a simple filter,
you just ask. Please show me only the traces that take longer than 300
milliseconds to execute and you see only the problematic
ones. You can take it further and implement smart tagging on
your spans and then you can also filter by tags
or some auto instrumentation agents also
add and reach the spans with useful tags that you
can filter by. So working with the filters will help you
flush out the problematic span traces very easily.
Next, combine and correlate
traces with logs. Logs together with
traces are extremely powerful and logs contain a lot of
more context that can augment what you see from those
traces. And the best practice is to
add the trace id to the logs. Very similar to what
we did with the request id. But you don't need to take care
of generating and propagating it, you just need to take it and use it
in your logs because you have those tracer the instrumentation
library that taking care of it the trace id.
So just embed the trace id as part of your logs and make it very
easy to correlate your logs the traces with
the logs and
the last advice around investigation is to look
for common patterns sorry in your timeline.
So this is really the last part of this
talk. I would like to look into a few common patterns
to look for when inspecting your traces
in the timeline view, whether in yeager or any other distributed
tracing backend tool.
The first one we saw that also from our own case is the staircase
pattern which often indicates a sequence of serial
calls. When you see this pattern,
you are more than likely want to check
if it can be paralyzed, something that can run in parallel.
Or maybe you can use some bulk queries or joins on a
database or something to reduce this sequence
invocation and reduce the overall latency.
It could be an issue in those implementation in the
code. It could be just a matter of misconfiguration like a thread pool configuration.
So staircase pattern is very common and
easy to resolve, but definitely can take a
lot of those latency.
Second one is span spike pattern and
this is the pattern of long spans, especially ones that vary
significantly between executions like the one we saw on our case.
And this may indicate either slow queries or
caching issues as weve in our case or similar issues.
It also helps to focus where you
want to put your investigation and efforts into.
So when you go about investigating,
you want to inspect and optimize the longest span on the critical path
and reducing this is likely to
produce the largest benefit on the overall request
latency. So this pattern also helped focusing the
efforts on where you'll see the most benefits in optimizing
latency. The next
pattern is gap between spans and
such a gap can indicate, like you can see here on those screen,
slow interservice communication
that contributes to the latency. Or it can also be
a service missing instrumentation. For example, it could be that
you send a query to the database, but the database driver
is not instrumented. So in
such a case you may want to go ahead and instrument
the database driver to get more instrumentation.
By the way, it's very interesting actually just this
month, a couple of weeks ago, only Google
contributed SQL commenter open
youre to open telemetry. So now within open telemetry
there's now an open youre piece
that can automatically generate application
driven metadata to enrich the
queries so that you can correlate database queries to application
logic much easier. So check it
out. The next pattern is spans
finishing at the same time like you can see here on the screen.
And this may indicate connectivity issue that is
causing timeouts or errors or some other artificial
constraint like locking issues.
Several requests waiting on a specific lock, and once the lock is released,
then all of them finish quickly. So in
this case we may want to tune the timeout parameter or investigate
why things are taking longer than the timeout that you
expected them to take or things like that.
Another important note here about instrumentation is that when you
see such cases, it may also call
for enhancing those instrumentation around the shared resources. So if it is
indeed a lock or, I don't know, call to a database or
something like that, you may want to increase the instrumentation there
to get better observability into what's happening
there and also what prevents those execution.
So that was
the last pattern. And just to summarize what weve seen,
distributed tracing for us was the
missing piece in our puzzle for battling performance issues
and for observability in general. And if you operate
microservices and cloud native system, you are more than likely
can benefit from it too. If you're looking into
how to go about it, then Yeager is a great open source option
for the tracing analytics backend and
for instrumented. Highly recommended to go with
those open telemetry. That's those future proofed path
standard way for instrumentation, by the way,
not just for tracing as I said, also for logs and metrics.
So definitely look into that and explore as much
as possible auto instrumentation to make your instrumentation
as smooth and easy as possible.
So try and enjoy. If you have any questions
or comments, do feel free to reach
out to me here on the chat Q and A.
Whichever, you can, just reach out to me on Twitter Dotan horovits
H-O-R-O-V-I-T-S whichever way I'd be youre than happy to
answer questions, hear comments, and get your feedback
and your experience about this. I'm Dotan Horovits
and thank you very much for listening.