Transcript
This transcript was autogenerated. To make changes, submit a PR.
You. Thanks to be here for this.
Talk about an introduction to opentelemetry tracing.
I'm Nicola Frankel, I've been a developer
or an architect or whatever for a long time,
like more than 20 years nowadays, and since a
couple of years I'm also a developer advocate.
So given my history,
I can tell you about what happened a long time
ago, perhaps even
before some of you knew about it.
So in the good old days, or not so good,
we had monitoring. And monitoring
meant that you has a bunch of people like
synthing in front of a huge screen full
of widgets, with dashboards,
with graphs, with whatever. Sometimes they even add
an additional screen in front of them and they were actually
monitoring, like visually monitoring
the situation. Sometimes you has
even alerting or alerting came afterwards
built. Basically monitoring meant you had low level
information into your information system and
you had to have people who could realize
that hey, this slight anomaly
here and this slight anomaly here meant that
actually something bad was happening.
It was working more or less, but actually
system became more distributed.
You can do this kind of stuff on
a huge monolith built when you have
components starting to interact across the network
with different machines, with different nodes,
then having this kind of insight
into the system through experience becomes very very
hard. And so monitoring
gave way to something called observability.
So you can think about observability as the
next level, as to provide insight
into not only one single component, but about
distributed system. And there actually are
three main pillars
of observability. The first is metric,
the second is logging and the third is tracing.
And though I will focus this talk on tracing,
I still need to introduce the others two because
they make a lot of sense and probably you need to be aware
of them anyway. So as I mentioned,
the first step to monitoring was to have metrics. And it was easy
because nearly all operating system give
some way to release the data of how
they operate. So you can have the cpu usage,
the memory, the swap memory, the number of thread
or whatever. And this is very easy to get.
But as I mentioned, if your system become more complex
and evolves more machines, it's not possible to
have them. So nowadays we still rely
on those metrics, but we want to have higher level metrics.
They can be like still technical metrics, such as
the number of requests per second, or the HTTP
status code, or it can be like business
metrics as the number of orders and this
kind of stuff now
comes logging. Logging is another level of
complexity, logging. There are
different questions you need to solve. The first question is
what to log. For example, in the past
it happened that I used like a Java agent
to log whenever I entered a method and whenever
I exited a method. So when I read the log
files I could see which steps were
used so I could do some kind of debugging in production
just by looking at the log. This is good,
this gives me come insight. Actually it's more like tracing,
but it doesn't tell much about
the business side of things. So again,
perhaps we want to log more interesting stuff such
as who is connected, what their session id,
how much items do they have
in their come this kind of stuff? This is much harder
and it requests manual introversion.
So you need to have a requirement and the developer
like actually writing the codes to log this. Of course
with auto instrumentation you cannot log every
parameter of a method by default because some of them might
be a password and you don't want to log the password.
So auto instrumentation in this case of logging
is easy built, doesn't provide much value
and can be actually harmful to the operation.
Then the logging formats. For a
long time we use whatever logging formats the
framework provided us. Whether you are using sLf four
j log four j, log four j whatever.
But nowadays it's perhaps better to use directly
JSoN, so that when you send it to another
system which expect JSON format, you don't
need to have an intermediate operation that actually
transform the human readable log into JSON
format, like directly print it in JSON formats
and you avoid one additional potentially
expensive operation. Then where to
log is also important. When I started coding,
I was explicitly forbidden to write to
console like if you use like
sonar or any other quality tool,
they will tell you hey, it's forbidden to use system out
print ln or system er
print ln. You don't write console nowadays.
However with containers you probably will want
to write to the console so that it can be scrapped and again
like sent further.
Logging on a system is good when you've
got a single components, or perhaps two components.
As soon as you've got a distributed system, you need to aggregate
the logs together so as to understand again
what happens across all those components.
Therefore you need to get all
the logs from all the components into
a single place, into a centralized logging system.
And again, that's more question basically you
need to ask yourself should I send the
logs? And basically meaning that
my application will lose some performance
to send it and actually might crash the app,
or do I expose only the log so
that another component can scrap them.
This is how Prometheus works, for example,
you add some endpoint and Prometheus will scrap the
metrics. Loki will do the same.
I mentioned about parsing the logs again,
it's better to actually write them in the format that
the centralized logging system can directly exploit than
to add an additional operation
in the middle. Then you need to store the log, of course.
Then you need to somehow search for the log because,
well, just having the logs one by one is not interesting.
So you need to search for them according to some filters, for example a
timestamp, for example a transaction id or anything.
And then you need to display them in a certain way to
look for the interesting bits, for example
the components that produce them. And I didn't write it,
but of course you need to somehow afterwards get rid of the logs.
Otherwise your logging system will grow and grow and grow and
well, you probably will have your disk full in no time.
You probably used one of those systems in
the past. I've been using elasticsearch a lot, it's quite useful,
but any other system will
do. Then comes the third
components, which is tracing the Wikipedia
definition. I love Wikipedia definition. In my opinion,
this one is not a great one.
So I've come up with my own, you will excuse me, probably it's
not my own I've got inspired by lots of people that
I don't remember but well, credits to them.
So basically, tracing is a set of techniques and tools
that help follow a business request across
the network through multiple components.
This is really really important. Again,
in distributed system you've got multiple components.
Your transaction, your business transaction will require
all those components to work together to achieve
the business goal. If one of them fails,
you're in trouble. And if repeatedly the same
component that says an issue, you need to be aware of it. So tracing
the request across all those components is important.
I believe you probably are aware about some
of the tracing pioneers existing already.
So there is zipkin, Yeager,
Openfresync. Those are the three
most like, widespread and famous.
Every one of them has proprietary
interfaces and implementation. But as you
know, we want something to be standardized so that the
whole software that
you write, we can write libraries that have
one standard. And for that there is something called
the double width three C trace context configuration.
It's quite easy to read, so you can already do
that at home. And basically if
you don't have time, let me refresh it for you.
Basically you've got two important concepts. The first concept
is the idea of a trace, and trace is
actually like the abstraction of the business
transaction. So the trace will go from entry
points to the ute, most deeply
nested components and back again.
That's the trace. And then you've got a span. The span
is actually the execution
of one port of the trace in a single
component. And you can have, of course for each
component you can have multiple spans. If you are interested
where inside this components where
the flow of the request goes and each
of those spans are bounded together in a
parent could relationship. So the first one is
the important one, the first component is the important one. It gives
the first span id and
then the next component will check the span
id of the parent and will say hey,
I am this span id. So it also generates a new span
id and say hey, I'm bound with this one. And through
all those parential relationship you
can actually make sense of the flow and check
the flow overview in one set.
Opentelemetry is a set of tools that
implements observerability
and that relies actually on
double three C trace context.
So it's an implementation and more so
it's compatible. And it also go beyond
because w
three c is good for web stuff, but perhaps you want
to trace a request through, I don't know,
Kafka. So somewhere where you can store and
in this case the specification does not
handle it. Opentelemetry allows you to trace this
business transaction across many different components,
not only web ones.
It's a merge of the open tracing and open census
project. So it's one of those few merges that
were successful where people decided to joined their efforts
to create something better. It has become a CNCF project,
which is good, meaning that it has support. It will be supported for
a long time. It's licensed under the Apache
license, so it's also good if you want to use it right away.
You don't need to acquire a license or whatever,
and it's popular,
especially on GitHub. The architecture
in itself is quite easy. Basically you've got
components, whatever they are,
and then you've got what they call an opentelemetry
collector. And this opentelemetry collector
accepts data in a specific format,
and this specific format allows to have these light parential
relationship between spans. So the idea
is on the client side you've got
stuff that dumps data into the hotel collector
and then you've got something that is able
to search and display data from the hotel
collector. The Opentelemetry collector
in itself doesn't provide anything, it's just the
storage stuff in a certain format.
So we need something afterwards Opentelemetry provides
a dedicated collector, but actually
Jaeger and Zipkin, which are also
like tracing providers, they are able to provide
the same collector, or let's put it
that way, they also provide a collector which accepts
opentelemetry data. So basically what they did,
they kept the storage engine. They added a new interface where
you can state your data in open telemetry format.
If you already have your architecture, your tracing
architecture, you can easily move to Opentelemetry because,
well, the collectors of Zipkin and Jaeger,
they have this additional interface. You just need
to change the formats and the ports
because I think every one of them has different ports.
On the client side it's,
well, I wouldn't say easy, but the first
step is straightforward. First step is auto instrumentation.
This is only available when you've got a runtime.
For example on the Java side
you have the GM, that is a runtime. On the Python side,
well, Python is a runtime node, JS is a runtime.
In this case you will delegate to the runtime to
do like auto instrumentation. I told you about
automatically logging, entering and
exiting a method. It's exactly the same here.
You will do that automatically.
It gives you already a lot of insight. Now if you want to
go further, you can actually get
the library depending on your tech stack.
Again, you can check the Opentelemetry websites. You will notice
there are lots and lots of stacks that are supported out of
the box. And for Java there is one. For Python,
there is one. For rust there is one.
Whatever rocks your boat, probably you will find there.
And then you can either call an
API or use some annotation.
As I mentioned, auto instrumentation,
very easy to do. You don't need to couple
your application to opentelemetry. It's a law of
being fruit, so you should probably do it right away.
If you are using a distributed system,
it will give you a lot of insight into your application.
I mentioned it's practical introduction, so here
let's try to do some practical stuff finally that we have
delved into the theory. So here is my use case.
My use case is well, simple for
a real application, but a bit more involved
for a demo. So at the beginning, an API gateway.
I'm using the Apache API six gateway.
It forwards the request
to my main application, which is
a bring boot Kotlin application. Actually it
gives like products API and
then it has the detail for the product
itself, but it relies on two other components, the one
for the tracing. The pricing is implemented in Python through
a sask framework application and
for the stocks. So how many items
do I have in which warehouse? I have created a rust application
using the XM framework.
So the entry point is actually the reverse proxy
API gateway. Most information
system have such an entry point. You probably never expose your
application directly over the Internet. You have something
in between because well you want to
protect your information system from illegal access.
So that's the most important part. As I mentioned, I'm using
Abeshi API six. Perhaps you don't know about Apeshi
API six. It's a Nabashi project. So basically
again like good for maintenance, everything will
be there with a license that will never change.
It's based on the very successful Nginx reverse proxy. Then you've
got like Lua jits,
additional openrest layer on top which allow you to do
scripting in Lua over the Nginx,
and then you've got out of the box plugins.
So to configure it there is this general configuration,
as I mentioned, like Apache API Six has a
plugin architecture. So here I say, hey, I will be using opentelemetry,
it's an out of the box plugin, you don't need to write any come
and then you can tell, hey, this is the
name by which I want to be known in the
data of opentelemetry, and this is where I will send
the data. So I will be using docker compose,
and so I have a dedicated Jaeger component
then for each route. So here I have
a single one built. You can have different configuration depending
on which route. You will say okay, how much
sample do I want? Normally, and depending
on your volume, you probably don't want 100% because it will
overflow your stuff. You want a sample
here, this is a demo. I will say I want to sample everything.
Again, probably not what you want to do, and then
you can log additional attributes.
So here, for example, I decided for no reason built for demo
purpose, to have the root id, the request method
and an additional hair. So if I can pass through
the clients, some has and they will be traced
along the span.
The next step is the GVM level. As I mentioned,
GVM is runtime, so I can easily
use auto instrumentation.
And on the GVM, auto instrumentation is for Java
agents, so this is quite easy actually. I just need to
pass the Java agent when I start
the application and I don't need to write
anything. So your developers, they are completely
isolated from this tracing concern.
They can write their code and everything will work as
expected. This is regardless of the language and the
framework because it's cheaper.
So here, this is how it works. Here is my docker
file to build my docker container.
This is a multi stage docker file.
First I will compile everything through a GDK, and afterwards
I will run it through a GrE because I don't need a GDK and
it's actually bigger and less secure. So the
first thing I do is like normal standard built,
and then afterwards I get the
jar that I just built and I add the
Java agent through GitHub. And when
I run it, actually I run it through the Java agent.
And this is as simple as it gets. You cannot be
simpler. Afterwards you
can do more precise, more fine grained calls
through manual instrumentation. It needs
an explicit dependency in the application.
This time your developers need to be aware of it.
And then there are two ways to do that, either through a regular
explicit API call or through annotation.
I'm benefiting from bring boot.
So basically I will use annotation. I will issue the codes
just afterwards.
Okay, now it's time to delve into the codes. I will just
focus on the Kotlin spring boot ports.
Everything is on GitHub. So in case you need to check, you can check the
python ports, you can check the rust ports.
Here I will focus on the Javascript.
I've created my application on
Springboot staller start spring IO. So here
I'm using the latest version of tools.
I'm using the latest LTS version of
Java, which is required by the latest version of
Springboot. I'm using also the latest version of Scotland and
then it's a reactive application. So I will be fetching
data, bring r two dbc, otherwise I'm using
webflox. So again to be reactive and the rest
is just like standard kotlin stuff. I didn't
want to bother myself with a regular database, so I'm using h
two, using the r two dbc, h two reactive
driver on the code side itself.
Here I'm using coffin. So I want to use coroutines
because well that's how you can do easily reactive
code stuff. So I'm using the coroutine cred repository.
This is my r two Dbc repository.
Here I have a handler and you can see that
I have suspend function. Suspend function
are like four coroutines in kotlin.
Then I have like one endpoint and
the other endpoint. This one endpoint is for all products.
This one endpoint is for a single product.
Let's see the first one. Okay, and then the rest will be
exactly the same. So I will fetch all product, I will
find all of them into the repository, which in
turn will look into the H two database, and for every
one of them, which probably what you shouldn't do in real life,
I will fetch the product details. So whether not
whether, but their price and their availability in
the stock. So here I can see how it works.
Again, I will have two different
calls, like protected under a nesting block,
which means that here, because I'm using this picture IO,
they will make the calls in parallel and
we can check that it works like this in the traces.
So I will get the price, I will get the stocks, then I
will merge everything. And here is how I
merge everything. So I transform data into
the expected data. And at the end I
create a product with details, including the products from
the database in the catalog, plus the price,
plus the stocks, that I have changed a bit.
For example, I don't want to return to the client any
warehouse where the quantity is like zero
or less, because, well, not zero or less, just zero doesn't
make sense. So I just filter them out.
And at the end I'm using the pins DSL
from Kotlin and the router DSL, or here the come
router DSL to assemble everything. And I start
my application with this bins method. So even if
you are not familiar with Kotlin, if you are a Java developer, I think
it could talk to you. And here you can see my two endpoints.
Products for products and products id
for my product. Now I assemble everything
through Docker compose. So this is the Docker compose file.
I'm using Yeager.
Yeager is available in multiple triggers. You can have
different containers. For example, here I'm using the all
in one. So I'm using the batteries
included package docker image in this case.
So I can already have the opentelemetry collector
provided by Jaeger. So it's
not the open telemetry collector,
it's the Jaeger collector that allows an open telemetry
interface. So here I don't need to think about the architecture of
Jaeger, I'm just using the docker image that does everything.
Then I'm using API six because I want to protect my
services. Then I have the catalog, which is the
spring boot Kotlin application that I've just shown. And here
I need to tell several configuration
parameters. The first one is where does the
Jada agent need to send the data? Well, to Jaeger
on this port. How will
it flag this component here it will be called orders,
which is bring, it should be catalog. Then here
does I want to export metrics here I said no,
I don't want. Of course, depending what you want to do, you can
also export metrics and logs, and logs, the same.
And now pricing, I do the same for the python application,
but again, this is not relevant for this talk. So here,
pricing, same stuff for the python application,
not relevant for this talk. Stock, same thing
for the rust application, not relevant for this talk. Let's start
this architecture. So it might take
a bit of time, especially with the GVM
to start. I will just speed the time
and let's go very fast.
Okay, the logs tell us that it has
started. We can check with Docker
Ps, Docker Ps.
So here it seems that everything has started. We've got the
catalog, the tracing, the stock and Jaeger.
Now we can issue our first curl.
So curl, I will be using the header that
I have configured, Apache API 64. So if
I remember it's hotel key,
then I can say, let's say hello world,
because I have no imagination and I'm
on localhost 90 80,
which is Apache API six,
default port, and we'd say products.
So it takes a bit of time because it will go
through all our systems. So here you can already see that the catalog
has taken, then none of the stocks, the pricing
and Apache API six as well.
So we've got the response, which is not very
interesting, has it is, but it still gives you the
data, you ask. Now the idea is to check the traces,
so I'm now on like the Jaeger web Ui,
and I can check and I will go exactly here.
And here we can see all the microservices,
so there are some traces here I need to
refresh, because here I need to find the traces, and here
we have single requests, because we sample everything.
Here we have it.
And we can see already with only auto instrumentation,
a lot of interesting data. So we can see
that we have our API six, which is the entry point,
and here we have the orders, which I misnamed, it should be called
catalog, but here it's orders. And here
we have the product, here we have the first auto instrumentation
inside of product, because we are using bring boot.
We have lots of proxies inside, you know how bring boot works?
And here decided, hey, here I will make a call for
a proxy, I will trace it. Here we have the
final why? Because it's an interface provided
by Springboot. We didn't provide the implementation. So basically again,
it's a proxy, so it's automatically traced. We can
see here that there is a call to the
other components called stock, so it's
traced as well, which is good. And here we've
got the second one. So here there is one for stock and
one for pricing. And here we see that in
one case I went directly to the component
and the other one went through the API gateway.
Both are completely possible. This is how I
configured it inside,
sorry, my architecture, basically in
one case you can say I want to protect everything that I
always need to get back to the API gateway to do some authentication,
authorization, whatever you want. And the other side you
say oh, I'm pretty secure, I can directly go through it,
but it gives you insight into your architecture as well. In case
you misconfigure something, you can check it through the traces,
something interesting as well. We can see that the
get calls to the stock and the pricing,
they are made in parallel because we use coroutine. So this
is also a good way to check that you actually coded your
stuff correctly. If you see one going after the
other, then probably your code was not right. Though tracing
is not made to do that. You can also validate some
of the come. And then as I mentioned,
it's not good to do that built here. For each of
them I go to the stock and the tracing stock and pricing,
stock and pricing, and we can check
for example that on Apeshi API six
I actually add the additional stuff that I sent.
So basically the routes and the get
and here I'm missing the hotel stuff.
So probably I didn't use the right one,
but believe me, it should work.
Now that also that already give us come information
about our flow. But we might
want to better, we might want for example on the
get to say like which internal,
before which internal method did we call
which parameter? So let's do it.
So now I want manual instrumentation.
It means I need to explicitly couple
myself to the library. So here,
because I'm using spring, as I said, I want annotations.
I don't want to have API calls.
Actually if you check the documentation of the opentelemetry
in Java to get an exporter is not that
fun, requires a lot of API calls.
And well, I have a notation, spring boot is compatible with
opentelemetry. So let's use it. So I've added this
additional dependency in my code and now we can
check the application itself, code itself and
we can go here and here we can see that I've
added like here this
with span, so this with span means that
it will be instrumented and you will find it in the trace.
So I should have these product handler products.
If I'm calling one single product, I will have this one,
but it's also possible to use additional
details. So for example here I will have this
product handle fetch, but I also say hey,
here not only capture this call,
capture this id.
So which product id will I fetch?
Which means that here it's interesting because normally
I shouldn't need it. Here you see that
the id parameter is not used because I already
have the product, but because I want to capture
the id I need to separate this
parameter so I wouldn't be able to capture
the span attribute product because then I could have
not the id but the whole memory reference,
unless I create a two string whatever,
which is not a great idea. So here I change my
method signature a bit to explicitly pass the
id and then, well,
I don't use it, but then it means that this will
be captured by the tracing,
by bring tools and I will find it, which might be super
useful, especially if it fails. So normally now
everything could have started and we can try
again with this configuration.
So here it's the same request. I've just changed
the header because I missed the previous header.
It was not hotel, it was ot. So let's
run this again. We can check that everything works
on the logging side. So here I'm in the catalog,
then I'm in the stock, the pricing, whatever. I've got
the response and we can check back on the Yeager
UI how it looks like we expect more details.
So first we can already see that we have more spans than before.
Just to check on the Apache API six side,
we can see that now my ot key,
this header has been logged, which is good.
And then we can see that I have the product here,
I have the fetch here. So basically
we added additional data. So inside the
components we added a couple more spans to understand
how the flow of the code went inside the components,
not only through the outs, across components.
You can also see that I did the same in Python, so if you are
interested you can check the code. So here I'm logging
the query like manually unfortunately,
but I'm logging the query so you can have additional information what
you are doing. So thanks for your attention.
I hope you learned something. I showed you how you
could use open telemetry, how you could use auto
instrumentation, how you could use manual instrumentation,
and I believe now you can start your journey.
You can follow me on Twitter, you can follow me on Mastodon
if you are on Mastodon. I previously wrote
a blog post about opentelemetry.
It's much more narrow focused. I've improved the demo
code a lot, but perhaps you can read the blog
post. It might give you some insight if
you are interested about everything. So the python,
the rust stuff, everything is on GitHub. I will be very
happy if you check it and if you store it just
to check there is a bitly URL.
So basically I can see how many people were interested in the code.
And though the talk was not about Apache API six,
if got you interested in Apache API six,
then you're welcome to greet us and have a
look. So thanks again for attention and I wish you
a great end of the day.