Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everybody, my name is Mike Laboman. I am the co
founder and CTO at Aspecto and I am
here to talk with you about distributed tracing for node js using
open telemetry. So even if you don't know
distributed tracing or open telemetry, this is exactly what
I'm going to focus, what they are, how they are working,
and, and to be honest, you don't have to be super
experienced specifically with nodejs,
but we'll focus mostly on the distributed tracing
and open telemetry part. So why did
I choose to speak about this topic? So, I've been working with
distributed applications for the past five years,
and as you can assume, distributed tracing are related to distributed systems.
So distributed system and microservices, something I've done a lot,
and for the past two years I've been doing almost only
opentelemetry. So I kind of know this space
and I wanted to share my experience with you. So let's
get started. So we will start by understanding
why do we need track. I'll give you like a good
real use case, real example of
when you need traces. We have all sort of solutions
that we working with today, things like logs, like metrics.
Why do we need another one? Once we will understand why do
we need it, we will learn what they are, how they work and how
to actually implement them. So let me give
you an example. So you're working
in a distributed environment and you
have a scenario where service a cannot
write to a database. And let's assume that you
know that from having a logs sending
you an exception alert saying hey services a can
write to db one. So let's think together.
What can you think to do at that point? Right? You have
a service that is not able to write to a database.
This is probably a high priority thing. We may have data loss,
most likely the user are noticing it.
How are you going to understand what's
happening there? So if you look at logs, they most
likely are going to point you to a specific location in your code.
So you'll have a line of code throwing the exception,
and then you'll go to that code base and you would find out
which other lines led to this specific line.
So you kind of play this game where you're trying
to go through the different files, the different components of your code,
and try to figure out what could led to this specific exception.
Or you are more of a metric guy and you would go
and say okay, let's see what's currently happening in
DB one. Maybe there is high cpu maybe
there is, I don't know, an I ops issue.
Maybe you go to have some metric telling
you about it and maybe it's just an increase in traffic,
right? Maybe I just have way more requests to
service a. And then I need to ask myself,
well, what endpoints in service a are
actually causing a query to DB one? And then I may ask myself
maybe it's not an HTTP call, maybe it's, I don't know, a Kafka message
that is being sent. So this is kind of
the thought process you are going through when you have
issues in microservices, in distributed environment.
Let me try to illustrate that a bit. So we have service
a and we have DB one. We know that,
but we don't know that. Maybe we have two
services producing messages to service a. And then the
question is, does only the communication between
service b and service a are causing this issue?
Or maybe it's service c, or maybe it's both.
So in some way,
when we looked at logs, went to our
code base and started to go through the path that
the code execution took, it's very efficient to do it within
one process, within one service. Talking about
multiple services, it's hard to do it. It's hard to jump
between services and understand how they interact with one another.
This is basically traces. So log told
us hey, this is the situation of specific process
a specific service. And this service is unable
to do a specific action in a line of code. The metric
told us kind of the overall situation of
the system. It told us that maybe DB one had high cpu.
The track is telling us the story between the services.
It's telling us what is the path that this specific
API call took. Maybe it was service b,
service a, DB one. And it kind of gave us the
context between the service within the entire system.
So we probably going CTO say that
we need all three. We need logs, metrics and traces
in order to understand how an
incident occurs. So let me give you
a quick look how a trace could look like.
So here you can see a system that present
traces and you can see right here that
all this track is starting
from an API call to purchase order
in order service. And the next thing that is
happening is that we are calling userverify in user service.
Then we are able to have this API
call CTO, an external API.
Then we have some save interaction
followed by another service that writes
to the database and eventually a Kafka message is being
produced. So I have this view kind of telling me the
map and this view kind of telling me the timeline, what happened in parallel,
what happened in sequence. And as
expected, you could click something and then get the overall
data, what was the requests, what was the response? And if we're
talking particularly about Kafka messages, as I mentioned before,
it may not be an HTTP, it may be
some messaging protocol such as Kafka, then you want to be
able to kind of correlate
between both the products and
the consumed. So basically for me at track
it's mostly this view, it's this
tree view, a child parent relation that
kind of tell you your request started at this
point, then it got CTO service
a or older service, and then the user service. And basically
this is going to tell me what were the interaction between
the different services. So that's tracing
for me, that's the ability to see a particular API
being propagated throughout the different services.
And this is kind of magic thing and the
way that it works. And that's I think very interesting from
development point of view. So open telemetry,
this is the standard way to collect traces. Open telemetry
can collect other stuff like metrics and logs, but mostly most mature
in traces. And this is an oh,
maybe before I'm starting to explain what it is.
Opentelemetry is an open source project, of course under
the CNCF, the Cloud native Compute foundation.
This is the foundation that is also responsible for kubernetes for instance.
So it's in good hands. So the process goes
that you implement an SDK within the code and within your
process, within your microservices. And then this
open telemetry is going to collect the data and
collect the track and then ship them somewhere so
that you'll be able CTO have this review. And this
is going to be a parent child relation between
all the different hops. And as you can see here we have
service a, service b, and a database. You can see that both
services have opentelemetry installed in them.
And by this I mean we took the Opentelemetry SDK
and we actually installed it within the service. And what
happens is that when service a and service b are communicating,
so it's very easy like logs for service
a to just report what happened and for service b to just report
what happened. But we don't want just the report of the
event that, hey, I got an API call, I want something
a bit more sophisticated. I want to know that when service
b is being invoked, it was invoked by service a
for that. What opentelemetry is doing is
when you send an API call between service a to service b.
Opentelemetry is going to inject the opentelemetry context.
What that means. It means that when service a
is sending an API call to service b, it's going
to leave like a breadcrumb that it's going to say,
hey, I was the one that sent you this message.
So when you're reporting whatever happened within service b,
please report it as a child of what happened
in service a. So all of those are
going to be shipped into a backend and let's
call it tracing backend for simplicity purposes and
let's see what is being reported.
So service a is going to say,
hey, I sent an API call to service b and
it's going to say this is trace id number one. So every trace
has its own id and every interaction,
every hop between services. Any action taken
within the span, within the trace we are going to refer as a
span. So here we are just reporting this is span
id 155 and we don't
have any parent because this is the root.
Then the 55 span id is going
cto be injected into the headers sent to service b
and service b is going to say, hey, I got an API call from
service a. It is still the trace id
one. I am span id 66,
but I have a parent and unlike it written
in the presentation that a mistake, I do have
the parent and the parent is 55. And then
when the pen reported
by service beat that, it's writing to the DB or it's querying the DB.
Then again we have the same idea. We're reporting
what trace id and who is our parent. And by reporting
this parent child structure, eventually we're able
to render in the UI how this trace looks in this nice
review that we saw earlier.
Okay, so this is how it works. So when
do you use it? We use it mostly when we have production issues.
We can use it in other places as well.
We can use it while developing, we can
use it in our staging environment, but mostly in production.
I actually wrote a cool open source to
how to use traces in your testing, like how to utilize
traces in your testing. So you can do a lot of stuff.
But the common use case would be how
am I understand what fail, how do I
improve something that works slowly? How do I understand whether the
system is working as expected
or not? So that would be the common use case. But if I try to
give it like a bigger name, I would say that we are trying to improve
our MTTR. MtTR being meantime CTo
resolve or recover or something starting with
r administrative that the problem no longer exists.
So that's what we're trying to do.
We're trying to solve things faster and by having this
cool image telling us oh, service a called b is called
c, and we have this specific
indication where the error happened and what led to that.
That's what's going to help us to solve things very fast.
So I've spoken quite a
lot and I really want to show you how
it looks in real code. Like what do I need to do in
order to have opentelemetry implemented in my code tomorrow?
So let me give you a quick look.
So here I have two services.
I have my user service and the user service
is doing something really simple.
But let's start with the item service. The item service has
a data endpoint and
what it basically is doing, it's calling the user
service that we saw 1 second ago and
we're responding that data. So the slash item is calling
user and then responds the data if
something doesn't work in slash data. So for
instance, if I'll put in my query string fail,
what will happen is I will respond with an error.
And you can see here that I did two specific
things around open telemetry, which I'll explain
in a second. The user service is also
doing a very simple thing. It gets an API
call. It communicates with some mocking solution,
randomize a number according to the length of the array
that we got. We are reporting to Opentelemetry
this number and then we are just responding
that and that's all good. Both services
are importing a file called tracer and just
provide the name user service.
And also the item service is doing
the same. So basically that's all you
need to do. When it comes to open telemetry. You just
need CTo have a single file. You will see that the installation
of it is quite simple. The code within the tracer and that
is it. All the other interactions that I show
you, they are not mandatory, but you can definitely go ahead
and add them if you would like to Soho, the tracer.
The tracer is actually very simple.
So basically opentelemetry collects
what's happening in your service and then going to send it somewhere.
So the Yeager exporter, it's going
to export data to Jaeger. We will see
Yeager in a second. It's an Opensource tool that can visualize
traces. So it either can be something you spun
up locally or some production endpoint
that you're using, or you're choosing to use a vendor, and then
you'll get a bit more feature than Jaeger and you don't need to
operate Jaeger by yourself.
So we're telling Opentelemetry where you're going
to send the traces. Then we're going to tell
Opentelemetry when you send those traces,
those spans. To be more accurate, please indicate
that this is the service name, so we will be able to distinguish between
services. Few kind
of generic setups. Here you are specifying what
kind of libraries you want to
be able to instrument to collect data from. So here
I went with a simple list of HTTP express, but you
can have a lot of other
types of instrumentation like kafka,
mongo, rediswssdk, you name it.
Most likely there is an instrumentation for your need.
Basic instrumentation means please collect data
from this library. So here specifically we're talking about the HTTP
library, the node native HTTP and the
exprs one. This is it. This is
all there is to it. Everything you would see is
going to work based on that.
I am running two services. I did yarn
users to start the user service and I did yarnitems
to start the item service. So let's
go and have a quick look what happens when
I'm sending an API call to data.
So I'll go to Yeager. This is Yeager,
and let me fetch the latest track. So this
has happened right now. And when I'm clicking on it,
you can see that we actually called data.
And you can see this is under the item
service. And then we communicated with
the user service. So you can see here that we
sent an API call CTO, our mocking service.
And you can see everything that you
would like to see that is going to tell you what really happened.
Now this trace is quite not that
interesting because everything is local,
the communication is quite simple, but this is how you
will be able to debug whatever is working
or not working. You do remember,
let me even show you that again. So in
user we're getting an array and we are randomizing
a number. So if I got someone
from Madrid, let me refresh that. So now I got someone
from London and I want
to understand why this thing happened,
why this data
was randomized. So what I did here,
I got the current span, the active span, actually handling
this code, and I just wrote a note, hey,
a number was randomized and this was the number that I randomized.
And you can actually see it right here. This is kind of
a log, right? This is kind of allowing me to send logs within my
track. So it kind of putting
them together. I can see what happened between services,
but it can also attach to those spans what
happened within the service. So this is a cool
trick that you can use if you want to use like add event.
It's very useful. Now let's do
something else. Let's make it fail.
So when I am running a query string fail,
let's look at the code. When I have a query
string fail, I'm throwing a really bad arrow and
I am doing a very interesting
thing. So what I'm doing here is I'm fetching
the current span. And in my logger,
in my console, I'm actually writing
what is the current trace id? So assume that
you have in your production environment, you probably have some log
solution, something like kibana or so, and you have
an exception. Now that's cool. But now
I want to visualize this specific exception,
not only logs, but also in traces. So you can see here
I'm printing critical arrow, and here I have my critical arrow
and I have my trace id. So I'm using to copy
that, go back to jaeger and just paste
it right here. And now I can see the
specific of this arrow and I can see
all the different things that related
to this specific action.
So we are kind of tying together the
logs with the traces so we'll be able to
jump from one another. So you can
add to your logs the current trace id or span id.
And you can also add to your span
something similar to logs.
So this is all you need to do.
This is everything there is to know.
And I would urge you to kind of
go and try because it's a really simple
line of code that you can start and get started
and see what you're getting from it. If you are interested,
the code examples are available right here.
So in GitHub aspecto IO opentelemetry
bootcamp, you can grab the first
episode of the bootcamp. That's almost exactly the show that
I showed you and that will
get you started with opentelemetry quite fast.
So I really hope you enjoyed this talk and if
you have any questions, feel free to reach out. And best
of luck with having traces.