Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to this session. Hope that you're enjoying the
42 conference dedicated to observability.
And thanks for joining my session. So what are
we going to talk about? This session is all
about open telemetry, but in a specific context,
the even driven architecture. My name is
Henrik Rexed. I'm a cloud native advocate within Dynatrace.
And prior to Dynatrace I've been working 50 plus
years as a performance engineer. Performance is pretty
much still in my heart. So that's why I'm producing content for a
YouTube channel podcast called Perf Bytes. So this is the icon
with the red is perf bytes. And then one year, less than two years,
I would say I've started a YouTube channel called is it observable?
Where I try to provide content for anyone who
wants to get started in the observery world.
All right, so we're going to talk about in the next 20 or 30 minutes
or so, a couple of things. So first,
obviously you heard entitled opentelemetry. So we're
going to obviously talk about opentelemetry. We look at the various components
involved in open telemetry, how to produce traces, because today
we're going to focus mainly on tracing.
Then we're going to jump into our topic of today,
which is event driven architecture, and see the various way
of instrumenting your event driven architecture
application. And of course you will see that we'll have some disadvantage.
So that's why we're going to jump into spanlinks. We'll explain what is
span links and how you can utilize them.
So before we start, let me tell you a story.
So when I was a kid, I really love to draw, to paint,
to create basically content, and because my grandma
and my mother teach me how to draw things. And it was pretty
funny because when I started my first job,
when I started in the industry, I was
assigned into an engagement as a consultant.
And at that time, in that technical environment,
my manager, we were basically managing different servers,
different applications. And my manager said, hey, we need
to draw or we need to visualize the current behavior
of our application. And that time we were having a couple of tools
like top and mon and so on. Those tools
are pretty much precise, but give you points. So with the help
of those points, the only thing that was able to draw was this.
So basically like a list of points so you can think
about, it's very zoomed in, so like a pixel. So it's very hard to understand
the entire context of our application. But with
the tooling, I was able to do that. Then 20
years ago we had improved our solution,
so we were able to store data, so we have history,
we were attaching the data that we were collecting with
metadata coming from CMDB, for example. So we had more
context. And because I was having stored points,
I was able to draw lines. That's why I started to
have a shape of the health of my web server.
Then 13 years ago, the industry
provided a couple of great product called APM application performance monitoring
that was allowing us to provide other box. Distribute traces.
Yes, distribute traces was there out there at that time,
metrics. It was also having some synthetic monitoring,
some real user monitoring and so on. So with this at least we
had a better understanding. And so I was able to start
drawing the eyes, the nose,
the mouth of the actual server that I was trying to
represent so much better. At least we still understand what
we have here. Then ten years ago, because we
were producing so much logs in our environment, we thought
why don't we start utilizing them? So we were parsing the logs,
indexing them, and try to get even more
value out of all the effort that we were doing in
terms of producing logs. So at the end we were
better. That's why I'm having the arms now on the
server. But still it's not perfect. Let me show
you what I thought to draw
and to show to the project. This is the actual drawing.
So you can see that at least I had kid and
the shape of the kid better, like a draft.
And then the kid was basically in
a forest or in a park or in a garden. So there was something
surrounding the kid. So unfortunately
with the toolings that I had, I was not able to basically
draw what the current situations. So at
the end, why I'm telling you that is because observability is not
a science. Observability is like an art.
You basically see things and then you start visualizing this.
So you need a couple of tools like pencils,
you need to have pastels, you need various colors. So let's
jump into the various toolings that we have to basically build
the art of observability. So what are the artistic
tools? Well, as you know, observability is about understanding
our current environment. So for that we need logs.
We talked about it this couple of minutes ago. We need events. So events
could be basically a system like kubernetes generates tons
of events, but you also have events from
your building, your pipelines, all the solution that
basically make your solution going to production. So you have a lot of
rich information that give even more context to the current situation.
Then we have metrics, we used to that traces, we talked about it,
and now you see that traces is becoming very popular. And then we
have producing. So profiling is basically one of the signals, which is very
powerful, because when you look at traces, you obviously want to go down to the
code and profiling will help you to do so.
But those pillars, those toolings are great,
but you need to add context, because without
context it doesn't make sense at all. So what are
the context? We need the technology where it's been deployed,
which server, the service, if it's a Kubernetes environment.
So, deployment file, the namespace, the pod, the version number,
very important, the geo, where is that server located,
if you have multicloud deployments, maybe also the geo
where this server is coming from. So at the end, the context is rich,
because it will help you to correlate the data that you have been ingesting
in your back end. But again,
a natural reaction from the market. If you look around you,
a lot of organizations and a lot of engineers
have started to implement observability, but if you pay attention,
they were using dedicated tools that were very specific for each signal.
So for example, metrics, I'm going to store it in Prometheus for traces,
probably Jaeger for logs, I want to use elasticsearch.
So at the end we have observity, but everything
is disconnected, so pretty much separated. So we are not efficient.
So we need to stop doing this and
we need to keep everything together. So then we are more efficient,
especially when we need to troubleshoot, understand a given situation.
So that's the purpose of the Opentelemetry project.
The Opentelemetry project, for those who never heard, but it,
Opentelemetry is an open standard, so it doesn't provide any
storage, it doesn't provide you any software.
It basically provides you libraries on helping you to build
a standard format observatory. So standard format for distributed
traces, standard format for metrics, for logs and so on.
So at the end we will put something in our code,
like an SDK. There's another component we'll talk about
in a few seconds. And then with those SDK, we'll be able to produce auxiliary
data. And that framework will add on top of that the
metadata that we're looking for. So the semantic convention, so the server,
the HP request, anything that we are related to our data.
So if we want to summarize the Opentimi project in
a very simple way, you have two things. You have the instruments of the SDK.
So at the end I've got a guitar, I'm going to produce logs, metrics and
traces, but I can play my guitar like this.
But if I want to propagate my could or
change the sound on the fly, I'm probably going to use an amplifier. So that's
the collector. I will send my sound to
the collector, I will be able to add some effects like chorus
eventdriven, and then with the collector I'll be able to amplify the
data to send it to any observer backend of our choice. And as
you know, any amplifier you can have a second output, so you can send the
data to several observed backend of our choice.
At the moment, opentelemetry supports several type of signals,
so the first one that has been initially supported was traces.
So now it's stable in most of the language that we use today.
Metrics is also stable for most of the language, except two of them,
PHP and I don't remember the other one to be honest. And last,
we have logs where it's under construction,
so we should expect it probably in Q
three or Q four of this year. Then we have continuous
profiling, the specification is on the way, so we should not expect it until
next year. So how do we produce traces?
Because today we're going to focus mainly on traces, a couple
of things. So first traces for this you
will produce traces, you can do it in various way, but we
will still need to add an SDK in our code. And there is
two ways. The first approach is manual, so you can say, oh, manual sounds quite
expensive, but if it's a fresh new application, it's similar to logging
if you've been used to do logs, logging produced logs from
out of the code, why don't you start producing traces out of your
code? Similar journey, but we never start from there.
Usually we start using the automatic and semi automatic
instrumentation that will instrument our well known
framework of the market. So for example, spring in Java will be
fully instrumented. If you use Python, there is a
lot of popular framework like Django that will be also instrumented.
There is plenty of various examples for every language. So at
the end, if you rely on a framework, there's a good chance that the traces
that will be produced will be quite accurate for your use case. And at
the end it will only produce data, but you will still need to send to
an obliterate back end to store the data that you produce.
So what is a trace? Good question. A traces is
a transaction, so I'm making an action, I will save, for example,
an order. So the save order will go through different components
in my architecture and that will be the trace. So to make this saving
order, it will go through different subtasks
within my architecture, within my microservices.
So I will have subtasks and those subtasks will be same
spans. So at the end of traces, very simple,
it's a big JSON array of spans.
So it's a list of spans and everything. To make the traces,
we will need a context. Context have the trace id
and a span id. And basically with the help of the trace id, this is
how we attach all the spans together to make a trace.
So a span has different information, so the name,
the attributes and so on. But we're not going to go too much into details.
So if I want to build my traces in
go or Java or node or whatever language,
I will have a couple of objects that will be common
to all of the SDK of opentelemetry. So first,
if I want to build traces, I will add the opentelemetry API, no problem.
From there I will be able to create one trace provider. So the trace
provider is basically the component that will help me to build my traces.
And you can see that there is various objects involved. So they have the span
processor, the sampler, the exporter,
opensource. So we will see those in details,
but the span processor will help you to determine how you're going to
send the data. Sampling is how much details we want to
send, Explorer is where you're going to send the data and resource is
the identity of service. And then we have a progator.
We'll talk about it in a few seconds. Once we have defined this, we will
be able to definitely create our spans and our child spans.
So what is a resource mentioned? It's the identity of the service.
So the resource is very critical because at the end,
if you have a service cart service order, each of those services needs
to have an identity. So then they will be basically displayed in the observy
solution. So at the end you need a service name, you need
service version. So a couple of things are required, but at least minimum
is a service name tracing,
sampling what it is? Well, we want to determine how
much data we want to send to observe backet. Of course we can send 100%,
but that will be quite expensive at the end. So we need to sample
and decide how we want to sample. At the moment the opening project gives back,
you have to configure the sampling decision by your own. So you have several
sampling decisions available. You have always on. So 100%
of the terms of data produced will be sent to the back end.
You have always off, which is the opposite, nothing. Then you
have parent base. Parent base is perfect when you are a dependency, like service b.
For example, in this slide where I will define that
only the information from service b coming
involved in a global transaction will be sent to the user back
end. Then you have trace id based ratio, where you define basically a
ratio, how many percentage? I say 20%. 10% of my data
will be sent to the back end. And then you have combination of both.
So parent based, always on, parent based trace id
ratio, parent based, always off and so on.
So here in this example I have series A-B-C-D and e.
And we're looking for an end to end trace. So the a have been configured
with trace id ratio because it's the first endpoint and the other one,
we configured it with the parent based trace id
ratio. So it will send the data from a global transaction where
they involved as a dependency. And we see that I've defined 20%
for the service b, 50% of service c and so on. But you can see
that the numbers here, it's very important
how you configure the sampling decision because you can have more or less details.
So in this example, if I take 1000 requests, out of these 1000
requests I will have only one end to end trace
from a to e, which is quite low. So you
have to tweak and configure it properly to get the right details
that you need. What is trace propagation? Trace propagation,
very simple. It's basically how the context
will be sent to our architecture.
So if I'm the first service a, I'm starting the trace.
So I have the context very easy, so I can push it to
all the various code, because at the end it's the same code.
I can keep the right context. But how can I make sure that the context
will be sent over to service b and the trace will continue? Well,
this is called propagation where basically we will inject the
trace context into before in our HTTP request, for example
in the case of an HTTP communication, and then we will extract
in a service b. And then once we have the trace context, we can continue
our trace. So now you know everything, let's have
an example. So in a traditional microservice architecture, distribute traces is
just fantastic because here you can see that I have an ingress controller,
I got server services. So with this I will be able to keep track
on all the tasks of a given transaction. And you have
a very easy way to visualize this data. For example, here I have an HTTP
request going through various services in architecture.
So here I see that I have 25 milliseconds of this transaction.
And if I want to optimize this transaction, I can clearly
see that. Okay, so we can see that list recommendation takes 20 milliseconds.
So if I want to optimize, I may probably want to
optimize that specific functions. And also what I discover here,
I could get product that is called 1234 times. So maybe there
is a more efficient way of doing it to reduce again the footprint of
this transaction. So fantastic.
But you say, okay, great, but you're talking about microservices.
What's the relationship with event driven architecture?
Okay, be patient, I'm coming there. Well, for distributed
tracing it's a but different, for example, you had a service, then I
send my data to a broker or to a pub sub whatever, and then
based on that pub sub, based on that event,
a couple of different services will start. So there's two different ways of
doing it. The first way is oh, let's do a big trace.
So from the service close to the ingress, to all the services
that is being triggered through that event.
So I will have a big traces, and you will see that sometimes based on
your architecture it makes sense, and sometimes it doesn't make sense.
So let's have a look at the first example, the end
to end trace I just referred, very simple. So here I
prepared for this event a GitHub
repo. So here's the link, so you can play around. So here I'm using a
pub sub architecture hosted on
solace, and I have the hotel demo, just to produce
traces on the side. And then I have a demo, I have a publisher,
and I have two consumers. One is on the
rest and the other one in database. So let's have a look at how we
can do that. What does it means in terms of coding?
So first let's bring me the code of the
publisher. So the publisher, nothing special. I'm going to look at the code here.
So here you can see that we're using the Opentelemetry SDK like expected.
We have a batch band processor. So every single things that I explained, the sampling
is defined. So first here I can see that I'm defining an
exporter. So it's going to use the standard exporter of opentelemetry.
I'm using a batch band processor to determine how I'm going to send the
data. So this basically with that I have a trace
provider now, and with that traces provider I'm able to start
creating spans, which I'm doing here, create span, start span.
And as you can see here, I'm adding some attributes to
give some details about this specific function,
and then I will be able to attach the context. In the case of
messaging, a couple of technologies support and the trace context
will be passed properly. And in this particular example I'm using
pub sub, so I don't have necessarily the SDK
that does it automatically. So what I'm doing is I have a trace, once I
have the trace, I'm getting the trace id and the span id.
So then I'm sending it as a property of the message, which means the consumer
will have to extract it as well. So now if we look at one of
the example of the consumer, like the distressed consumer, same thing,
we defined the traces provider, nothing special.
And here you can see that from the message, I'm extracting the trace
id and I'm extracting the span id. So once I have
those, I am defining a new span context and
I'm linking it and I'm restarting the child spans from
there. So that's from the code perspective, but what does it represent
in an actual traces example? So if I open
my browser, I have
it already displayed here. So first let me bring me
to the services. So which means here I got a dynatrace services,
I have different application running. And here you can see that I
have first of all the consumer database, the consumer rests, those are the two things
and the publisher. So those are the information.
The three services I'm running. In my example, if I click on the publisher,
I will have details about the actual services. But what I
want to show you is the actual end to end traces. Remember I started
traces, I send the context to the message and then the
consumer extract the context and continue their task as well.
So I can see here, now I have a big trace with the publisher
sending the data, and then I get every single
details, have the consumer rest and the consumer database.
But the problem is, as you can see here, we have a four
minutes transaction and four minutes in terms of open
to me distribute trace is very difficult. So if I want to optimize the publisher,
I have no idea because here it traces, then a
couple of hundred milliseconds. So if I want to reduce the publisher,
I'm not able to have the details because technically it's not visible.
So it's not perfect. So that's why I think we
should improve in a way,
because it's not well designed for our example. So you can see this is
what I was showing you. Is it a useful trace?
The answer is, of course no. I may need to change
the way we're doing it. And the great examples is my second example,
using span links. So before we start with these second examples,
what is a span link? Well, in the case where you
have a long transaction, in our case was not two minutes, but four minutes,
you can see that the black span, which is in the first one,
is mainly not visible. So if
I have this, I can basically start a traces and
then use a span link. So then
the consumer will start a new trace and
link that trace to the publisher. So at the end I will have three traces
with this implementation, one for the publisher,
one for one of the consumers, and last
one for the other consumers. So at the end I have three traces,
much more easier to consume, much more easier to understand what's going
on. So again, let's jump into the information. So I did a second
version of this implementation where I changed slightly the code.
So let me show you what I mean. So if we look at the code
here, let me go back here I go to the other
example where it's the same rest that I showed you. So the publisher,
I didn't change it, it's still sending the trace context in the message. I still
need to extract my trace context. So, trace id and spin id.
And here I'm starting a traces, and here you can see that
I'm tracing a link to the actual context
of the publisher, and then I'm starting a new span, and here I'm
attaching a link. So it means that the code is
slightly different. But what does it mean in
terms of ui, in your observable backend of your choice?
Well, let's jump into dynatrace now I have another example
where here, this is one of
the example of the rest that I was showing you. And you can see that
I have three steps, much more simpler to use it. But what is interesting here,
if I pay attention to the first one,
I have here links, and then with that link
I can click on it and it will bring me back to the publisher
trace. And you can see that I have 100 millisecond trace,
much more easier. And here I can see that if I want to optimize it,
obviously I need to optimize the connect pub sub, which could be very difficult,
but at least I know where I'm spending most of my time
on that specific steps. Okay,
so now that we've seen that very
important, let's jump into the conclusion. So first, pros and
cons. So first, I would say using
spannings is great because at least it keep track
properly on the consumers.
It's easier to visualize these things.
But the major disadvantage that you lose,
basically the connection between the consumer and the publisher, from the consumer
to the publisher, it's very easy. I know that this consumer comes from that
trace, no problem. But from the publisher, I have no idea how many
consumers has been triggered through that message. So it's more
difficult to get a better vision, a broader vision of these
things. The other disadvantage I would say is sampling
decisions. We mentioned of sampling here I got a publisher
trace and then I have a consumer trace. If those are two different sampling decisions,
I may lose details. Maybe I won't have the publisher trace
anymore. Or maybe I won't have the consumer trace anymore. So again,
it's not perfect, but it's easier to consume. So depending
on, on your implementation, you will have to decide if you need an
end to end trace or the user of spannings. That's why it's an art
to do observability. So first,
as a takeaway, make sure your code is agnostic because
again, the idea, we don't want to use any vendor locked in. That's the concept
and the culture of opentelemetry. Second thing, if you start
doing observability with using opentelemetry, make sure you add the right context in your
metrics, in your logs, in your traces. Otherwise you
won't be efficient and again, be creative. Understand your
system, design the right observability technique
based on your application. All right, so just
a small teaser of the YouTube channel isn't
observable. There's plenty of content covering opentelemetry and other
observability framework or agents. So check it out. It's a
quite young channel, so by looking at it, it will help me to
be more efficient in the way I'm producing new content.
All right, so I hope you enjoyed that session that you learned some
stuff, but if you have any questions, I will be very
delighted and honored to answer to any of your questions.
Thanks for watching, see you soon.