Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi all and welcome to this Conf42
session called hacking opentelemetry.
My name is Andrea Caretta and I am a senior consultant for liquid
reply it company and I'm here with my colleague
Alberto Gastaldello as observability expert.
Our job consists in searching and or designing monitoring
solution for system at full stack levels,
infrastructure applications Andor front end user behavior,
adopting both enterprise platform and open source
tools reason why our path collided with opentelemetry and we
never left that way. Just a quick summary about
what we're going to show you today. We would like to start with a brief
introduction about how we handle observability and we consider it
so important. Proceeding with the clarification of what open telemetry
is, we will mainly focus on distributed
traces, one of the three observability pillars,
in order to explain how we managed to hack the tool.
But be careful, it's not properly only
a tool. We'll see later and transform the collected data,
showing also a little demonstration.
So let's start with few observability tips.
In system theory, observability is a property able to
measure the level of internal states, determination and interpretation
given input data and external output. It's a
system attribute, of course, not an activity or a tool.
I could adopt tools to reach observability, for example.
It's a kind of mindset instead of a result,
the disposition to realize a system, not only to work,
but in addition to be observed, to be seen.
Let's represent my system as an iceberg where the visible
part it's a kind of a black box where I'm able to understand information
only through outputs, through symptoms. Explain with the
red matrix request rate, error and duration.
I have to open the black box in order to understand the causes
of the outputs. Adopt the use metric utilization,
saturation and errors to perform the root cause analysis logs
are important, the same as a diary of the system events
and many additional pieces of information. Traces instead
are able to correlate user behavior with the last query performed on
the DB, introducing a cause effect concept in a single occurrence
compared to the aggregated values of the matrix.
All of them contribute to fully understand
the system behavior, both in fully operational and anomaly
situation. So what is open
telemetry or better, why Opentelemetry?
Open telemetry was stated as a standard de facto for observability.
During the last years, many leaders, cloud providers
or enterprise observability platform companies contributed
to its development, starting as a framework, spread it in different
programming languages, then it was included in the
CNCF as an incubating project where
every pillar for which opentelemetry defines semantic
rules and references as a maturity level.
Only Tracy specs was released as a stable version,
and this is one of the reasons we're mainly focusing on this pillar instead of
the others. Finally, new softwares with
different purposes than monitoring started to adopt open
telemetry standards to generate valuable telemetry data
and provide them to external tools out of the box.
These kind of solutions are really important to understand where the market is
going. Auto is vendor agnostic,
since new companies are not always so confident
to let agents to be installed on their products.
On the opposite, they prevent every kind of external monitoring.
As I said before, opentelemetry defined data
types, operations and semantic conventions in
the beginning and officially born as a framework composed by different
sdks in different languages to be included in an
applications project. Then monitoring
capabilities were synthesized in agents able to
instrument scripts and virtual machines, j in Java
node js, Python net and many
other technologies. And after a while a kubernetes
operator was created in order to instrument pods with agents
in microservices clusters. So we currently have
the chance for supported technologies to obtain data out of
the box without touching any line of code.
Last but not least, the open telemetry collector
is one of the most important artifacts in the project.
Those is a kind of proxy able to receive opentelemetry data
in different formats, editing and or filtering it,
and sending it in different formats to the desired backend.
Taking a look to the dataflow, it's clear how it's really fundamental
in environments with heterogeneous data sources
to translate in
expected format the information and convey them to target
in different protocols, different compressions and
different serializations. The most
important thing to understand from here on out is that opentelemetry collector
is the tool we hacked to perfectly andor fully handle
telemetry data. So let's go deeper
inside the distributed tracing world. Aderto,
tell us your point. Thank you Andrea. Let's start
with the definition of the free fundamentals concept in distributed tracing.
The trace is the word request record that follows those path
top down. Through the architecture of a system. A trace
is composed of spans that are like a measure unit
in this field. Each span contains all the
details of operation that are made within a service,
allowing us to know how much time the step
has taken.
But how can we correlate all spans
that are related to a single request?
Those tracing context defined by w
three C in 2021 allows to propagate
span data with the trace parent and the trace state
headers. I'd like to point out three more
important definition that I use throughout the webinar.
The trace id identifies the wall distributed
trace. When a span is created,
a unique id is generated for it and that is
the span id. As the request propagates
through services, the caller service passes its
own id to the next service as
the parent span id. Trace Id
and parent span id are then included in the traceparent
header. Let's see together an example
of distributed application this is sockshop,
an open source demo application.
It is the usual website architecture.
We have a front end various other microservices in those backend
with two databases andor a queue manager order,
shipping queue master and cut services are auto instrumented
using the opentelemetry operator. When a request
is received by the backend, a tracer is generated
as the request propagates, then through components, spans are
added to the tree. All those data is sent
to the collector that eventually processes it and
then forwards to dedicated storage backends.
Let's see how those works. We open
sockshop web page. We choose an article
and then add it to the cart.
Now we are opening the cart and proceeding
with the checkout. Let's see what this
last click generated in our observability tracing
backend.
Each span represents a step in the request timeline.
We can reconstruct the path followed by
the request, for example, orders,
call chart services and then the shipping
service. I'm showing you two tools to
highlight the fact that both an enterprise offered like dynatrace and an
open source one like Grafana show traces in the same way
thanks to the standard data format.
Okay, all nice and clear, but what happens
when we do not have those possibility to export
tracing data? Can we make the system observable
in some way? Fortunately, the answer is
yes. Let's see how in
many situations we encounter application that are not instrumentable
due to restriction imposed by software providers.
They just send out logs that are then stored
in a database, eventually passing through a telemetry
processing layer that decouples the application and
the backend. Unfortunately,
in this way we only deliver data for
the second of observability. Pillar logs.
For the sake of simplicity, we leave out the metrics.
Pillar Andor focus on the traces from now on.
When application logs contain tracing data that is
as said trace id span id andor parent span
id. This is all we need to correlate them
in order to transform a log into a trace pan. We can
leverage the processing layer. We use
those Otel collector for those Andor now let's
see how it works Andor how it can be manipulated to
reach our goal. The collector
is composed of three main modules,
receivers, processors,
exporters. The receivers accept
telemetry data with different protocols.
Processors allow data filtering modification or transformation.
Exporters send that elaborated data to endpoint
storage. There are two collector versions,
those youre and the contrib. This last one includes
the youre modules and all the additional modules developed
by contributors for their purposes.
As you can see, collector components are defined
in a YAML config file listed
in their own sections, with the chance to customize
them with many comments. The highlighted collection
is the service section where pipelines
can be configured and each component
can be enabled only in the step it is made
for. We have to be careful about the pipelines
in the auto collector releases because they are fundamental
to understand. From here they arise from pillars division
and they are really independent from each other. A trace information
received could be only handled as a trace andor sent anywhere as
a trace, same as metrics and log at every step.
What we found out breaks this model. We paid
our attention on logs pipeline andor. We detect a
point in which information could be manipulated
to switch from a structure dedicated to logs to another structure
related to another pillar. In this case, we were interested
in trace structure. We built
an exported able to retrieve trace id and span id values
contained in the log and represent them as
key values for spans before sending them to a target.
Able to represent distributed traces with
an application able to produce only logs,
meaningful logs with tracing ids we have in
the end all we need to create correlated spans
and so distributed trace the
translation takes place in those collector.
It leads to a well formatted trace pan that
becomes a fundamental piece across the end to end
path where a trace before could have been broken.
Then the modified exporter forwards the
log to the log storage in our case locky,
and then the trace in the tracing backends
for our case Grafana, tempo and Dynatrace.
Let's now see how this works with a demo.
We compiled our modified version of those opentelemetry
collector and now we run those executable
in order to have the service listening on a local port,
we created a python script that simulates an application
sending event logs. It sends
logs in the syslog format to the local port of
the collector that is running locally.
We decided to focus on modifying the exporter
at the chain level is the last point to modify data before transmission
the collector takes each log and transforms it
into a trace, then sends the original
log to locky and the trace to both Grafana tempo and
Dynatrace. Let's open our tools
to see how this can be visualized in
Grafana. We have those possibility to see logs and traces in the same
dashboard. With a logkey query. We can find our generated
logs pretty formatted in JSON for better visualization.
This highlights the switch from the logs pillar to the traces
pillar. From here we can
directly find the corresponding trace thanks to
the integration that queries Grafana tempo.
Looking for the trace Id.
The visualization is pretty straightforward.
Those same can be seen in the other tool,
Dynatrace. We navigate into distributed
traces and visualize the trace generated by our Python script.
Today you saw how to manipulate and transform telemetry
data for your purposes. When complex distributed
systems handle data adopting different formats,
opentelemetry allows you to define pipelines
and move data wherever you prefer. Become agnostic
from every type of youre thank
you for watching. For any question about this session or
our offer, feel free to reach us through socials or
via email.