Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, my name is Brunfre and
in this talk I will tell you about optimizing observability
using the Opentelemetry collector for budget friendly
Insights. I'm currently a DevOps engineer at Theydo and
they do is a platform for customer centric collaboration.
The good news is that we are hiring for different roles,
so feel free to check out our product and careers page.
I think most of us agree that 2023 was not
economically great. We had a rising inflation, lots of tech
companies doing massive layoffs, and hiring freezes throughout
the year. And yeah, taking these contexts
into account, the last thing you want is to have a
$65 million bill for your observability
system as we read on the news during last year for
a famous cryptocurrency company. In this talk
I will share our experience at theydo adopting opentelemetry,
and hopefully you can get some useful tips on how to avoid
this kind of spending when setting up your
observability system. Last year we
did a major effort at Deidu to adopt opentelemetry
as our observability framework,
and depending on the platform that you are using
to store and process your open telemetry signals,
either metrics, traces and logs or logs,
you might be either charged by the amount of ingested
data or in some vendors you will be eventually throttled.
If you are ingesting more data than you should
or it's included in your plan.
And that was our main issue during the first few
weeks after adopting opentelemetry.
As you can see here on the image on the right,
our usage was low, so we
were below the threshold during weekends,
but during weekdays we were above the
threshold, so we were above the daily target.
And as we had more load on our product
and we were ingesting more data than we should,
eventually we would be throttled and events
would be rejected. So we had to do something about this.
So the first question we asked was ourselves was like,
do we really need all this data? And the answer
was, well, probably not. So our
first action was, okay, then let's pick the data
that is actually useful for us. But how?
The first thing was to really think
how we were using the auto instrumentation. And the first tip I can give you
is to be careful with the auto instrumentation. If you're using
one of the popular libraries to have auto
instrumentation for node JS or Python, it's really
common that the default configurations will be just sending
too many spans that you won't need.
So auto instrumentation is really useful to have your initial signals
sent to your observability back end, but at
some point you will need to optimize it.
Here's two examples that caused us some trouble.
So the auto instrumentation for the AWS SDK
as you can see on line five and six.
So by default it is
not suppressing any internal instrumentation. And this means
that any call that you do to s three,
for example, you will have at least four more spans.
So you will have the parent spend
for the s three action. In this case, as you can see,
is a put object, and then you will have four
more spends. You will have a put
for the HTTP action, and then you will have another
one for the DNS lookup, another one for the TLS
connection, and another one for the TCP connection. Of course
this might be useful in case there's any DNS issue
or something, so you will see it right away
on the trace, but that can also be enabled later
when needed if there's any weird behavior that detected
on s three operations that you suspect that might be related with
some DNS problem.
But most of the times you probably won't need
all these internal spans on every trace.
A similar situation happened on the auto instrumentation for
COA. So by default
it will create a new span for any type
of middleware you have on your API.
And most of the times you won't need all
of them. So you can probably ignore it
and enable it later. Or if you have any suspicion that
one of the middlewares can be the
root cause of a LATC issue, for example, or you can
go for manual instrumentation and apply the needed
instrumentation inside the middleware logic itself.
So in this case, as you can see, we had lots of spans
on every trace that were automatically
sent by the auto instrumentation for COA.
And how we solved this was to ignore the
layer type and ignore these middleware spends,
but this was not enough. So another essential technique
we used to filter out the needed data was to
do tail based sampling. Tailbased sampling is basically
where a decision to sample a trace happens after
all the spans in a request have been completed,
and to execute tail based sampling. The most popular tool is an
open telemetry collector. For this we have
multiple options. Here's some of them. So the first
one that we tried was the AWS distro for Opentelemetry,
a dot, and it's an
AWS supported version of the upstream Opentelemetry collector
and is distributed by Amazon. It supports the selected
components from the open telemetry community and it's fully compatible
with AWS computing platforms such as ACS
or EKS.
It has some niceties like being able
to load a configuration from an s three file,
so you don't need to bake the configuration for the
opentelemetry collector in the docker image and
you can just retrieve it from an s three.
But we had some issues that we couldn't
go for this solution, and the first issue that we
found is that it didn't support all the processors that we need,
especially the transform processor. And I actually opened
an issue for this, so currently
it's not included on this distribution.
And the other problem was
that it didn't support logs at that time. Now it
does. It was announced a few weeks later, but at
the time it didn't support logs
and we needed it because we also wanted to enrich these
logs with some extra attributes and we couldn't
use the opentelemetry collector for this.
So yeah, be aware of these things and
the documentation on the repository is pretty good, and you
will have a list there of all the processors available on the distro,
so you can see beforehand if
it's the right tool for you or not. So taking this into
account, these limitations
so we had to go for the official upstream open telemetry
distribution, and here you have two options. You have the open telemetry core
and you have the open telemetry contrib, which was the one that
we used. The core is a limited docker image and it includes
components that are maintained by the core open telemetry
collector team, and it includes the most commonly used components
like filter and attribute processors and some common
exporters for Yager and Zipkin.
And the contrib version includes almost every processor
component available, with some exceptions where
the components are still in development. So if you want
to create a slimmer image, because if
you find that the open telemetry contribute has too
many things that you don't need, there's this recent blog
post from Martin on how to build a secure open telemetry
collector so you can create a slimmer image with just
the processors that you need, with just the components that you need.
And it's also a good idea in terms of security,
of course, because you
won't include any processors that you don't
use, so you reduce the
surface area of attack and it's
not that hard to build. As the blog explains,
it's not that hard to build an open telemetry collector from
scratch. And finally, you have also some
vendors like Anacom that provide their own solution for tail
based sampling. So you have refinery in their
case, which is like a complete different project from the opentelemetry
collector. And it's also a sampling proxy
that examines the whole traces and then intelligently
applies sampling decisions to each trace.
These decisions determine whether to keep or drop the trace data
in the sampled data forwarded to
Anikom.
So our current architecture for
the opentelemetry collector usage,
it looks like this. So we run the opentelemetry collector
as a sidecar and our app
container forwards the spans to it.
Then the collector also calls the metrics
endpoint on the app here to
fetch some metrics from the Prometheus client running
on the application. And regarding logs, they are being tailed
by fluent bit sidecar, another sidecar that we have
that then forwards the logs to the opentelemetry collector
container. And then the opentelemetry collector filters
the spans and enriches also the metrics,
spans and logs with new attributes like the identification
of the task that is running and other
attributes that are useful for us. And then
it's responsible to send the
metrics, tracing and logs to one of
these backends. It can send like to honeycomb, to Grafana,
to datadog any vendor that supports the OTLP
protocol,
and then you can visualize your data there.
Regarding the collector configuration,
you will have to do that configuration on a YAml file and
here we can see a visual representation of that configuration.
For logs, metrics and traces we used
the image was generated on the hotelbean IO,
which is also a really great tool to visualize your configuration
for the open telemetry collector. And on the left
you can see for logs, metrics and traces you can see the different
receivers, so OTLP or
Prometheus.
And then after the data is received,
you will have the different processors that will
process the data filter, enrich with more
attributes, and then in the end
you will see the destination of those of the
open telemetry signals, which in this case it's also OTLP,
so it is sent to a backend to then visualize
the open telemetry signals.
In this case, the type of signal that was generating
more data for us was the traces
and that's when we needed to act.
So the first processor, so let's
focus on the pipeline for the traces. The first
processor on the pipeline configured on the collector is the batch
processor, and it's really simple. It accepts
the spans and places them into batches. And batching
basically helps to better compress
the data and reduce the number of outgoing connections required
to transmit the data, so it's a recommended processor to use.
After that, the data is then handled by the next processor,
which is a tail sampler that we call default.
As I will explain later here,
the trace will be analyzed and it
is not dropped. If it is not dropped,
it goes for the next processor.
The next processor is another tail sampler in our
case, and here if the
trace is dropped, so all the data
on the trace can be dropped depending on the
configuration. So let's see how these two are configured
in our specific case. The first tail
sampler named default has three different
policies. So the first policy is
the errors policy that will send or
that will sample any trace that constraints a span with
an error status. So we assume that
if it has an error, it will be an important signal
that we can then analyze and get to the root
cause of it. The next policy is
the latency policy where we check if
the trace took more than 100 milliseconds
to be processed, or in this case like if the
request took more than 100 milliseconds to be processed.
We also sample the complete trace and the
main idea is the same as before. So we
sample slow operations to then analyze it
and get to the root cause of it. These two
policies will already filter out most of the
simple operations that you might have, like status or health checks
calls that you might have on your API. But you could
also filter those kind of calls explicitly by
using the HTTP path or attribute, for example.
That would be another way to do it. And finally,
the last policy is to sample any trace
that contains a span
with a specific attribute. In this case we
sample every trace that contains a
span with the graphQL operation
name Resync project. So in
this case this might be an important operation that we will always
want to sample. For example, it can be like a
new feature that you want to check its usage,
and we will always want all the traces related
with that operation. An important thing
to notice here that you maybe already noticed
is that these policies have an or relationship
between them, so the trace will be sampled. If any of these
conditions is true, you can have
multiple tail samplers. In this case we have
two of them. So the next one we called
it synthetics. And this exists
basically because we have synthetic monitors to check our API
every minute from different regions,
and on each of these calls it will generate
multiple spans that are not interesting at all if they
run successfully. So therefore for this processor we
configured the same way. So we configured
an error policy and the latency in
this case, the latency is bigger than 1 second.
So if one of these synthetic monitors throws an
error, or it took
more than 1 second to complete,
then we sample the data, because that's
an interesting event, right?
Then we have two extra policies in this case.
So the first one is to sample only
1% of the synthetic requests that are successful.
And this might be useful, like to have an idea of
what's the average latency, for example on
the synthetic requests.
And as you can see here,
we can create a policy with two sub policies,
and it will evaluate them using an and instead of
the default r.
In the end, the last policy will serve as a failover
to basically sample all the other traces that do not
have the cloud watch user agent. So in
this case it's the opposite of
the previous one. So we check for all
the traces that so we have invert match equals
true. So we sample
100% of the non synthetic requests. So it's
basically a failover. So we sample all the other
traces that do not have the Cloudwatch user agent on the attributes.
And this is important because without this failover this
processor would discard all the other traces because they wouldn't
be evaluated as true by any of the other policies.
The last processor we have that also filters some data
is this filter that excludes any span
taking into account an attribute.
And in this case, the main difference here
with the tail sampling is that the tail sampler
filters complete traces, while this one is filtering
just specific spans. And because of that we
need to be really careful when doing this,
because we need to be careful with the data being dropped
because dropping a span may lead to orphaned
spans if the dropped span is apparent. So ideally
you would create rules here to guarantee that
you drop only childless spans, which in
this case is true because as you can see in
this case, like we are dropping only trivial
graphql attributes, so we are looking
at the field type and we are dropping only
the ones that we know for sure that
were trivial and they were childless spend.
So they didn't have any child, so it
was safe to drop them.
And most of the times they were not interesting because they would complete in just
a couple of milliseconds. So it was just information that
we didn't need to keep. So in
resume, the logic behind the configurations I previously
mentioned are represented pretty well on this image
by Riz Lee and posted on the official opentelemetry
blog. So on the configurations for your tail sampler running
on the collector, you will want first the traces with
high latency and errors,
and then sample. Also traces with specific attributes,
as we did, for example, for the resync operation
that we had there. The others 99%
of the time won't be interesting for you,
and you can either discard them or execute some random
sampling on them if you prefer, and have the budget
for it. And this way you can
have an efficient
and economic observability framework.
And that's it. Thank you. And hopefully
you got some useful information on how to use the open
telemetry collector to keep your wallet safe.
And if you have any feedback and
questions about what I said here, you can find me
on LinkedIn or on Twitter.
So, yeah, that's it. Thank you. Thanks for watching, and bye.