Transcript
This transcript was autogenerated. To make changes, submit a PR.
Let me give you an overview about open telemetry. Here's everything
that I'll be walking you through today. I'll give you a
brief background about open telemetry, the core concepts,
the building blocks and architecture of open telemetry.
We'll quickly dive into the instrumentation part where we'll
see the code and we'll start instrumenting with
traces, metrics and logs for a simple node JS application
using the co op framework. It's a very basic application where
we try to cover all these concepts and try to extract telemetry that
makes sense to us. Lastly, I'll also cover about open element
collector how you can get started, how it's beneficial
and when you should be using that. And then again,
we have heard from developers across the world it works in my machine.
It's an Ops problem, Ops complaining that it's the app problem.
Today I've even heard people complaining that it's my container
is working fine, you're just not deploying my container correctly.
Let's see how open telemetry or observability helps resolve this
conflict in today's world. A quick background about
open telemetry open telemetry today is an incubating
project in CNCF landscape. There has already
been a proposal made to make the project to a
graduate state. It was originally formed in 2019
by a merger of two famous projects, open tracing and mobile census.
Open tracing was developed by Uber to monitor their
terms of microservices and open sensors was developed by
Google for the purpose of monitoring their microservices
and telecommetrics. Some of the core goals of open telemetry
is to provide a set of APIs, libraries,
integrations to connect the telemetry from across
your system and services.
It helps you set the standard from to collect the telemetry from
all your application and infrastructure.
One of the best part about open telemetry is how it provides you
an option to set all this telemetry collector to
a choice of your observability back in, which means you're not logged
into a single vendor or any specific tool.
Regardless of how you instrument your application,
services and infrastructure. With open telemetry, you are free
to choose where you want to store your telemetry in house,
third party, or a combination of both. You'll see in
this in this chart how quickly open telemetry has risen.
You'll see that today. Open telemetry is second
most fastest growing project in CNCF space.
It is right behind kubernetes with the number of contributions
and adoption. This is because there is a strong
interest in modern observability. There was
a report by Gardner in 2022 that
a lot of companies are looking to enhance open standards,
which is what open telemetry, EBTF and Grafana are working towards.
If you want to read more about it, you can scan the QR code on
top right? Let's see some of the core concepts and the building blocks
of open telemetry open telemetry is
basically a specification. It's not as one specific
framework language or an SDK.
Open provides the specification with which each individual
languages and frameworks develop their own set
of SDKs. Now, these SDKs are built on top
of the API specification provided by open telemetry.
These APIs are based on tracing metrics and logging.
All these APIs follow the same semantic convention across
so that any language or framework that are being built for
these using open telemetry remains standard.
And today you might instrument Java application
and tomorrow you have to instrument bay with
that. Having the same specification and semantics.
You do not have to worry or revisit the documentation
or reinvent the vein often. Most SDKs
today provide you with the option of automatic instrumentation.
For example, no JS provides you an option of automatic
instrumentation with libraries like Express core,
MySQL, and any other common framework that are
being used with node JS. We'll see about that in some
time when we get into the hands on part of that.
Lastly, one of the important protocols and part of open telemetry is open
telemetry protocol. This protocol is used to
send all the telemetry that is collected from your applications,
infrastructure or services to the choice of your back end.
OTLP works on two famous protocols.
One is HTTP or GrPC.
Depending on your system architecture or your requirements,
you can choose to use either, or you can choose to
use both. Let's quickly get into the hands on
part and see how we can get started with instrumenting. Simple service
as no. J's, the conventions remain same across
other languages. The APIs and APIs are
similar. Only thing that will change is the packages
and some SDKs. Here's a very simple application
which is built on top of known express using the core framework.
Core is a very simple light framework similar to
Express, which helps you write rest APIs.
Quickly we'll get into each of the packages of open telemetry
that we'll be using. Let's start with the tracing
and automatic instrumentation with node for node JS,
how we export that telemetry to our collector and
moving towards metrics and
combining all these three, sending it to collector and how
collector exports all these metrics to Neuralick
new relic is one of our observability backends.
Neuralik helps you provide the contextual information by
stitching together all the telemetry exported from the collector.
This is a very simple application as I mentioned. You'll see
that there's nothing much. It's just a very basic application
which has a certain endpoint. Here I have
at least four API endpoints which is one root path,
a post request, and another request with get which
accepts certain parameters. Each of these requests will be automatically
capturing the traces with by using the open elementary
SDKs. Now to start with the safest
option and the easiest option that you can opt
to get started with open telemetry is automatic instrumentation.
We will not be modified anything from in the source code of this core.
Instead, the recommendation from open telemetry for node JS
is that you create a separate wrapper file which will
be the primary module for you to start your Nord Express applications.
We will start with setting up this file and adding all
the packages. I'll walk you through about the details of each
package that we are using and the topic that we will be focusing on.
Firstly, we'll start with the tracing. For that we'll be focusing on
a couple of packages which are automatic instrumentation for node
SDK trace node SDK trace base
and if you and you need to focus on
what we are importing from each of these each of these different packages
for the automatic instrumentation, it provides us an API called
Get Node auto instrumentation which helps you capture
the telemetry automatically from all of your node js and underlying
libraries. There are some conventions for you to set up
your open telemetry service, right? For that we'll use some helper packages
like semantic convention resources which helps us
configure our application name and other attributes of our
application correctly. Let me quickly scroll down to the part
where we are setting up the tracing for our application.
We'll ignore everything else that's configured for now and point
you towards what is important for you to get started quickly.
Firstly, we require a tracer provider.
A tracer provider is an API which will register
your open register the application with the open telemetry
API. Now here is where we provide our resource.
Now this is the resource where we have set up our application name.
This is the most basic configuration that we are setting up here.
What we'll do is we'll just add our resource name which is will
be our open elementary service name. Once we
have added our name, we can set the configuration of how frequently
we want to flush our traces. Flush basically
tells the SDK how frequently you want the telemetry
to be sent out. Once we have the provider API
configured, we have to add a span processor. Now traces
are built of multiple spans. Each span
is a is an operation within your application that
can pull information about its execution period, what was
the function, and anything that happened in the specific operation
like error or exception. Span contains all that information and
building stitching all these spans is what is called as a straight
as a trace. Now in this trace provider we
add a span processor. A span processor basically sets
up an information. A span processor
basically sets up and tells the SDK how each of
these spans should be processed from this application. For this
particular example, we'll be using the batch band processor. It is
also a recommended processor so that there is not much
frequent there is not much frequent export
so that there is not much operational load on SDK
for processing the span. The batch span takes
a number takes a few optional settings which are default.
All the values that you see here are the default values. You can increase or
reduce this as per your requirements. Basically batch plan
processor collects all your span and processes them in a
batch. It takes another parameter which is
an exporter which says where all these process spans
have to be exported to. For this one I've configured
it to a simple character. Character is will be running in my
local setup in a container. I talk about container towards the
very end. Once we have instrumented our application covering
the topics of traces, metrics and logs,
all of these will be exported to our collector and our collector
will be exporting it to other to our observability backend which
will be neurofic. Now in the collector there's only one configuration that's
required because the collector is running locally. A simple localhost URL
which is the default URL for open telemetry collector traces
API and once we have done that, added our batch span processor,
we can optionally register certain propagators.
Now in the same trace provider will register w
three c baggage propagator and trace context propagator.
These propagators helps us find the origin and stitch the
request that is hopping through multiple services.
I'll show this in example of what it looks like once we have included these
propagators. These basically helps you give you the overview and
complete picture of how many different services your request
has opt on and what was the operation or what was
the problem at particular specific service that happened and
helps you capture that information by stitching it all together.
Once we have configured everything for our trace provider, we need
to register our instrumentations.
Register instrumentation is part of a open
telemetry instrumentation library and you can import
that and start adding our instrumentation.
Basically, register instrumentation tells you what
what is that you want to focus on this application's
instrumentation. First thing we need to provide it provider
for us. The tracer provider is our trace provider. That which is
configured and in the instrumentations list,
which is an array, will provide everything that we want to focus on.
The get known auto instrumentation library has tons
of libraries which can automatically capture
metrics and traces from your application. I do not want
to capture any of the file system operations that node Js or my
Goa framework or co op framework does
any operations on. I do want to capture
that anything that's happening with respect to core framework
be captured and in the same auto
get node auto instrumentation library instrumentation and
in the same get node auto instrumentation configuration
you can add much more. For example, if you have my
sequel or anything, you have tons of different libraries
which are prepackaged and all you have to do is add it to this
register instrumentation list and it will start capturing that information.
Now we'll focus on this part later. Once we go through the
logs part, this specific configuration will be focused focusing
on decorating our logs. Now, this is all the
configuration that's required for you to start capturing your
traces from your node application. Let's quickly see
in the console window what the traces looks like once we
start instrumenting. Once we start utilizing
our wrapper with our node application. Now in
my terminal, before I start my application, I'm busking a few environment variables.
These are the two environment variables which are helper,
which are basically helper configurations. For me to start this application,
the hotel service name is what will be referred by the wrapper,
which is going to provide you a name of that specific service.
And the portal log level basically helps us debug all our
configuration that we just did in our wrapper file.
This particular command basically tells node that before loading
the main file, which is our index js,
load the portal wrapper and then load the index js.
This basically loads hotel wrapper as the primary file and then
executes our index js. This way we are able to capture
the metrics from the start of our application.
Now, my application is running successfully on port 3000.
Now this is a typical behavior of any basic Nord express or
node core of node core service.
Let me add another send some request to
this particular service that is being run. I hit the roots
in I hit the root of this specific API and
you see I hit the root API of this service
that's running and you'll see that my API
responded and got executed. I got a console log
from my application which is saying my service name, what is
the host name of where it is running, what is the message and the
timestamp. This is a typical logging of console
dot log or any logging library that you are using.
For this instance, I'm using the bunion library for my logging application
and it adds certain attributes. The rest
of the output that you are seeing is not from the API,
but it is actually from open territories.
Debug logs this is a typical trace that is,
that gets exported and this is, these are the attributes
that are automatically attached. You'll notice the trace id and span
id that got attached to this specific resource.
Let me call on the API now. You'll see I got
the response here and the debug output
from my orbital metric. The SDK
prints out, the SDK prints out the debug
logs for all of the requests that are coming in now.
Be mindful that we did not modify anything in our actual source code,
which is our index J's. All we have done is added a wrapper
wrapper around it and all these attributes are being captured
by SDK using the automatic instrumentation.
I'll start my application with certain environment variables that I require
for my hotel SDK, one of which is hotel service name.
Auto service name is basically telling what the service of my
name should be when I start executing this or export it to
any observatory backlight. The other variable that I have here
is auto log level. This helps me debug any of the
problems that are occurring in the hotel configuration
as part of the wrapper file. And the command here just basically
tells node to load the wrapper file before loading the main module, which is
the index js. Let's execute this file and
see what happens. You'll see that there are a lot of debug
statements that got printed. This is all because we
have set the log level to debug. It says that it's trying to
load the instrumentation for all these libraries, but there is not many,
there are not many libraries that it's found. Only the library that it finds
is for Node J's module and HTTP
and also for core, and it's applying patch for that.
Basically the libraries which are pre built as part of automatic instrumentation,
it patches onto that. So any requests that are going or any operation
that have been made from these libraries or any operations that
are happening as part of these libraries are captured.
Now you'll see that there are a couple of libraries which is bunny are
being also been patched. We can do it how the logging works,
but for now the logging is disabled. Once we have enabled
that, you'll see that the middleware framework is being patch,
which is our core, and some more
debugging statements that are being generated. Let's quickly scroll
down and we make a request to our service here.
What I'll do is make a simple call to the root API
that are configured. It's just going to return a simple hello world hello
world response and let's see how the trace is being generated
from open telemetry. So this is my logging line from
my bunny logger, and let's see what happens after
that. All this output that we are seeing right now is
the actual trace that is being generated from our particular request.
So if I scroll to the top, this is my log line that
got hit and generated as soon as I hit my API.
And these are the spans that got created, which is
basically capturing all the information of execution of
this particular API. You'll see it's capturing the
metrics from the core library. It's having certain attributes.
What kind of span processor are being used? What is the different
body parsers that have been used that we are using in our application,
the different span ids, as mentioned earlier,
multiple spans and build a complete trace.
And it provides a parent span id. So whatever the first request
had, the parents, what are the first requests that came into the
system becomes the parent span, and that id is
attached to all the rest of the lifecycle of that particular request.
There's a lot of output, again, a lot of debugging output which we will not
focus on. We see how this output looks like
in our observability back end once we have exported. For now it's
getting exported to collector and routed to a backend.
I'll show you directly how it looks like in the backend.
I'll cover deeply about connector and how we can configure that
we are exporting all our telemetry to Neuralink using the OTLP
protocol. Let me click on my services and
my service convert 42 hotel is already available here.
You'll see some metrics are already coming in.
Since I'm not capturing any metrics for my service yet, I switch
to spans that are the cache captured from my service using the hotel
SDK. We'll see. Some of the requests and response time
are already being available here. I'll quickly just
switch to distributed tracing to look at the rest APIs
and the traces and analyze them. I'll click on this
first one and start to see that there is this particular
request. Let me click on the first request in
this list. This is giving me an overview of what
the request response time cycle was.
It took 3.5 milliseconds for this request
to get completed. There are certain operations that happen and
if I just click on this particular trace you'll
see there are attributes that got attached which we have seen most of
it in the debug window, the entertainment,
the duration, what was the SDB flavor,
what was the host name, the target? So for
we made a request on as a root rest API and
that's what's being captured, the id of this particular trace,
what was the type of request, and you
also see how this actually was captured.
The instrumentation hyphen HTTP this library
was patched and it has captured this particular request and
the library version. There are certain other attributes which
are of help. If you have attached any custom attributes,
it should also get listed in the same space.
This is regardless of where you export your telemetry. It should give you an experience
similar to this. Neuralink helps us get to the point quickly
and that's why we're able to materialize all the traces
fairly quickly. Let's look at some other trace.
Let me make some two different APIs.
I have the endpoint to which I'll make a request
which is giving me basically weather information about
the particular location. So I'll change the location to where
I'm currently at and the API
responded very quickly. So this particular endpoint
basically makes a request to an external service which
is also instrumented using open telemetry. But that
service is not on my localhost, it is actually deployed
elsewhere on an AWS instance. Let's see
how open captures the essence of this request and
helps us get the stigma information to give us a complete picture.
I'll make a few more requests so that I'm able to have sizeable
data that we can go through. I'll also make a request
that can that should fail, with which we'll see what the
errors look like. Once we start implementing our application
using open telemetry. I'll just give a blah blah land
and we'll see this specific request returned with 404.
Now to show you quickly what happened, I'll just show
you in the terminal. How many requests
happened, what were the debug logs and if there were any
errors. So you'll see this specific request failed
and had a 404. So my request has completed, my trees got generated
and this was the parameter that I have called. That was my
HTTP target. Keep in mind we have not modified the source
code of our application. We have just added a wrapper around
our main application. So this is helpful if you want to quickly get
started with open elementary so you don't disturb the existing application
code and start experimenting with SDKs.
Let's look at this specific trace in our back end coming
back to our distributed tracing. I'll click on trace groups now
I see that there are few more trace. Let me click
on this and get into it further.
Now there is this one request on Toggle,
which was the most recent request and it has the most duration of
all the other requests. I'm assuming this should be the request
for weather. Let me click on this and as
you see here, you'll see a map of the journey of your API
request and how many other services it has topped my
original service node js, from which I'm from which I made the
request, and it has made an external call to an
external service which is instrumented with open telemetry,
which is in turn making another external request which is making at
least eight different calls. And now that is unknown because that service is
not incremented. Let's expand and see what happened underneath.
You'll see the information of all the operations that happen underneath.
There was a get request which we made from our system.
It went to API weather request of
Node express service, which is the Olodexpress portal,
and there was some middleware operations and it also made a
get request which is an external call.
We can check what is the service that this particular
service made a request to. You'll be able to understand what
service external services are causing slowness in your application
so that you are able to improve that particular area and aspect of your application.
So basically this service is the open weather map application,
open weather map service, and from which from this particular
service it's requesting all the weather information.
Once we have all this information, it becomes easier
for us to triage and understand the behavior of our application,
not just in happy scenarios, but also in problematic scenarios.
Let me quickly go back and click click on the errors.
You'll see that this particular request which failed at 404. Now this
is being highlighted here. I quickly want to
understand what has failed and new link provides you a
good map and overview of which services were impacted.
In this particular map, both the services are highlighted as red,
which means both these services had some form of errors. We have seen
that individual operations being captured in the spans,
but if you want to focus on your errors, there's this convenient checkbox
that you can just click and you'll see that there is a get request
which actually failed and there was an error. Now this,
since this is just automatic instrumentation and an external service,
there is not much details. Now this is where manual instrumentation
comes into picture. Once you have identified the areas, or once you
know what areas you want to instrument, you can use the manual instrumentation
and customize your error message, or even add additional spans
to support your debugging and analysis journey of your
applications. That's all about the tracing.
We have set our tracing for unreal application using the
automatic instrumentation trace provider and stress
train the instrumentation using the pre built pre package
library release as part of the automatic instrumentation.
That's as simple as it to get started with automatic instrumentation
for your node j's applications,
let's get back to our code and
include a metric instrumentation open
industry provides us with packages to help
us capture the metrics from our applications.
In this part, we'll focus on configuring our
metrics and extracting the metrics of our application over
similar to the traces, there are a couple of packages
that we need to be aware of, one of which is
SDK matrix. This particular SDK provides
us with meter provider, the exporting readers and
helper function for debugging, which is console metric exporter.
Similar to setting up a trace provider. What we do
is we start a periodic exporting meter where
we are configuring our meter provider to
send all the metrics being captured to a console.
But first, what we want to do is capture all these metrics from our
application and export it to a packet.
In this case, we are setting up a OTLP metric exporter
without any particular particular URL. One of the default
settings of any exporter SDKs for any of these traces,
metrics or logs is that it always points to localhost 4318
or 4317 depending on the protocol available.
It tries to export it directly to that. The connector
supports both 4317 which is receiving on GRPC,
and 431, which receives on HTTP. Now for
meter, once we have our meter exporter, we'll start
with setting up our provider as it
races require a trace provider. Meter also requires
a meter provider. What we'll configure here is part of a
meter provider. We just supply a name again, a resource which
has a service name and the number of readers that we can add.
This can be an array or a single value. Here I'm
adding both a console metric reader and a metric reader exporter
which is going to send it and export it to our observability packet.
Once we have our service provider once we
have our meter provider service, we can register it globally
in two methods. One is using the open telemetry API
which is open telemetry metrics and setting a global meter
provider. We can use this we
can use this in cases where you do not have
trace provider or register instrumentations API available.
With this you are able to configure just the metrics provider.
But since we are using the automatic instrumentation API and
a trace provider with registered instrumentation, we are going to
enable this in the larger scope of our application as
part of the list of instrumentations. The API that also accepts is
meter provider. Here we can specify the provider that we just
configured, which is our my service provider which contains the
console metric reader as well as metric reader exporter, one which
exports it to our console, another which exports it to a bucket.
That is all the configuration that's required for us to enable the meter provider.
Let's look at how the output changes for all our open
telemetry metrics. I will disable the
tricks debug information so we are only able to
see the information from our
meter providers. Let me switch back to console
and restart my application quickly clear the console.
I'll retain the logging level since I have modified it directly
in the go. Everything else remains the same. Now we
see a lot of debug output because we have just disabled
that, but we will start seeing the output from the meter provider
once we start hitting any of our endpoints. Let me quickly hit some
of the endpoints and see what it looks like.
One of the configuration that we have enabled is the flush
timings for the console we have set it to,
but you can go as aggressive as 1 second. It is
the default settings is set to 60 seconds. Since meter can
become too aggressive and EW cpu cycles,
it is recommended that you configure it and tune it as per your requirements.
Now this is the output from our meter provider which is capturing
the information for histogram and it is capturing the duration
of our inbound HTTP request, which is in this case is
our endpoints that we are just calling. There are some data
points and values that it attaches, but it's basically
easier to visualize this rather than reading the raw data,
but it also helps you understand what kind of data is being
captured with these console outputs. If I make
a few more calls to different endpoints, I'll start
seeing similar output. There's not much difference except
for the value and start and end time of that request
registered or the application level basically captures all all the
operations for available libraries. Since we also registered
this as part of our automatic instrumentation, it's going to capture
all the operation for each of these endpoints that happen in
our let's switch back to our observability
backend which is neuralink, and see how the metrics are being reflected.
As earlier, all the traces that were being captured
were exported via collector. Metric is also been sent
into the new relay via collector. I'll switch to metrics
in this instance and you see now the metrics chart has
started getting populated metrics. Capturing these
metrics helps you populate these charts, which gives you insight about your
response time, your throughput, and if there are any errors,
you also start capturing those details. Additionally, if you want to
dig further into metrics, what are the different metrics that you have captured?
You can go to the matrix Explorer and see for yourself.
I'll come back to my goal and now that we
have captured the traces and metrics, it's time
to focus on one of the most important telemetry
logs. One of my previous mentors had a famous
saying for all the engineers loves logs.
If there are no logs, there's no life. And that's particularly true
especially for DevOps and SRE engineers. If the services go down,
they start digging to the logs and try to identify what has actually gone wrong
before they can recover the services. Let's focus on
the logs aspect of instrumentation. For our node
J's service we have focused on metrics, which was
fairly simple. All you require is a SDK metrics and the
exporter. For logs it's similar,
but requires a few more steps than setting up your meter or your matrix
provider. For the logs we focus mainly on the API
logs, SDK and another package
which is SDK logs. The SDK logs provides us with logger
provider, the processor and the
log exporter. These APIs helps us set up our application
to properly set up the logs attached with all the
information with regards to traces and metrics. We'll see
how all this ties up towards the end. The library that
I'm using as part of, as part of our application here
is Bunyan. Bunyan is simple logging library,
which is a very famous library for adding any any kind
of logger for simple service. There is
a library available already. If you're using Bernie.
There is a library called Instrumentation Bunyip which helps you capture logs
in open telemetry format. We've seen in the console that logger
log format of bunion is slightly different. We'll see how
that changes automatically. Without modifying any of our application
code. Using this package, we'll set up our logger
to start using and transforming our logs into standard
open dimension format. Now firstly, we require
a logger provider which again accepts a resource and
our resource is again the same global object where we are setting up the
service name. This is particularly important if you want
all these material traces, metrics, traces and logs attached to
the same service. If you do not provide the name, it's assumed as
unknown service. That's the default name that it accepts.
It's always a good practice to add your own service name and the
default value for exporter is this endpoint which
is localhost 4318 version one logs
each of the exporter endpoints and each of the exporter exporter
APIs for different SDKs have these dedicated endpoints configured.
I've included here for your reference. If even if you do not
add this particular endpoint is going to point it to the default connector
receiver to the default endpoint.
Once we have our exporter and provider, we can
configure and attach the processors that we require.
Similar to the span processing logs also have different processor
which is simple and batch processor. I'm using a simple
processor for the console exporter and batch processor
for exporting it to our back end where I'm using the exporter
similar to the metrics provider where we can register it globally.
If you want to capture only the logs from your application, you can use
the portal logs API which is open telemetry API with
logs and use the global logger provider.
But since we are using the automatic instrumentation, I'll go ahead and register it
as part of the register instrumentation and that is
all that's required for you to successfully include logging with
open telemetry in your node application. Once we have enabled the
logger application, we do not have to modify anything.
Since I'm already using the bunion here, it automatically
patches the instance of bunion with open telemetry logger.
Once we enable our provider with the logger provider and register it with the list
of instrumentation, we can add additional options
that we want for the bunion instrumentation to modify.
For example, it provides a log hook option with
which we can modify the log record and
attach any attributes that we want. For example here I'm attaching the resource
attribute which is the service name from the trace provider.
Any customization that you want to add to your log record with open telemetry
and bernouin you can do so here. Once I have
enabled this provider and added my instrumentation, let me restart my
application. I keep the command as same
and you'll see it's running. My application is now running. Let's hit
the endpoint and see what the output looks like.
Now you'll see apart from the default application log
line which was coming, which is the bundling standard of logs,
there is another output which is coming from the logger provider of logs
and the information of all the trace id span ids, the severity
is coming. This is because once
we have included our log, it attaches
all the other information for that particular trace and any additional
custom attribute that we have included. Now in this case, the attribute
that we have added as part of the logbook is the service name.
This can be particularly helpful if your application is running on
multiple hosts and you're streaming all the logs to a central location.
This can be helpful to identify which particular service is breaking and
where the location is. Once we have set
up our logging we can see this in the backend. Let me
quickly generate some more load so we can see all the logs for different I'll
just hit it a couple of more times. One, two, three and
let's switch to our back end there which is nearly and
see what the logs look like. You'll see conveniently there is a logs
option within the same screen which you can click on. Once I click
on this you see that all the logs have already started to flow in
and they have eight requests that I've already made. Once I click
on any of this request you'll see that all the
logs, the body, what is the service name, the span
id, the trace id have been attached, but this is
not present as part of the standard logging statement of bunny.
Any logs that are being generated as part of our application
is now patched with open elementary logs provider and it decorates
our log message with all this additional metadata.
Beauty of all of this setting up logs and traces and metrics
comes into picture now to having to reach to the
root cause of any problem, having the right context is very
important. For example, let me make another request to my
weather API where I'm going to fail it with a
wrong parameter. Once my API has failed,
I'll fail it a couple more times. Now I'll come
back to my back end, which is neuralink, and I start seeing
all this very stitched together and providing me all the
context of the failed request. You see that information like
matrix and span are all available already, but what I want
to focus now is on the errors. We've seen the errors in
the context of traces and how that looks like. What I'm particularly interested
now is to see the relevant logs of that particular trace.
Let me quickly switch back to the distributed tracing and switch
clip on the errors. You see the three requests that just
made have failed and there are three different errors.
I click on this particular trace and I'm seeing the errors for
these particular services. This is something we have already seen as
part of the trace exploration. But now what I'm interested is to understand the
logs related to this particular request. I do not
have to navigate away from this screen now. This is particular to
the neuralink platform, but this is also a beauty of open telemetry.
The place id and span id that got attached to the log statements
come into picture of being really helpful here you'll
see a small logs tab on the top. And once I click on this,
you'll see there are a couple of log lines which are having
from particular request and that specific function.
Now, in a scenario where you have tons of requests and
that particular trace, and a particular trace fails, you would want
to understand that particular log. And to go through the tons of
logs is already tedious. Having the right context
helps you get to the root cause really quick.
In this case, I'm able to reach that particular log line without having to navigate
away much. The other way to reach to this particular stage is through
log screen. Let's say I'm exploring logs from all
these services and I see there are a couple of errors and I want
to understand what trace was this that actually
failed. I see that trace id is already attached and
there is a log message that's also available. But what
is also available is getting to that specific request directory.
Once I click on this request, you'll see that it opens directly in that trace.
And this completes the cycle of combining
the traces, logs and metrics. It provides you the complete metrics.
The trace cycle and the logs thus
operate can help you avoid pointing
fingers to dev or ops. Also helps you reach
to the root cause of your application problem very quickly.
Now that we have seen how we get started with automatic instrumentation
for load j's and capture traces, logs and metrics,
let's look at another important piece which is open telemetry collector.
The collector is a very important piece and part
of open elementary which helps you capture information from
your infrastructure as well as your microservices and different
applications. The collector is built with three different components,
receivers, processors and exporters. Receivers is
where we define how we want to get the data into
collector which can be push or pull with the
application automatic instrumentation that we have covered. We are
using the push based mechanism where we are sending all the telemetry using
the SDKs processors is which where we
have we can define how we want to process our telemetry.
We can modify, attach any custom attribute or
even talk attributes that we do not want.
Exporters again work on the same principle of push or pull with
where we can export the telemetry to multiple
or single packets. Basically it acts as
a proxy where multiple telemetry formats can work as
an agent or a gateway. If you want to scale the character, you can send
it up behind a load balancer and it
can scale up as per your requires. Here's a simple example
of open data collector where we are using the configuration
file to add a receiver for collecting host metrics.
You'll see once we add a simple host matrix block in a Yami file,
we are able to capture all this information which is system dot memory
utilization file system information, networking and
aging information. There are tons of more information that you can capture.
All you have to do is define in the receivers block for host memories.
One of the important concepts with collector is how we sample our data.
We have seen different processors for span and long which is simple,
and batch processors with connector. The concept of sampling comes into
picture of what we want to capture and how we want to
capture. There are two most famous strategies,
one of which is head based sampling and another is stain based sampling.
Headbase sampling is the default that is enabled where it
can just overall statistical sampling of all the requests that are
coming through and the tail based sampling is something that
captures and gives you the information for most actionable
trace that are sampled. It helps you identify the
portion of the trace data instead of the overall statistics of the
data. Tail based sampling is recommended where you want to
get only the right data instead of the tons of data that are coming through
and flowing in from across your systems. You'll see
how it changes depending on the sampling strategy on the left hand
side in the configuration you will see in the processors block, we are defining a
policy to enable the database sampling. You'll see the
throughput that has changed before and after. Before we
apply the database sampling, there has been a lot of throughput and cpu cycles
that collector was consuming. This is because it tries to process all
the information that is coming in and tries to export that.
With tb sampling we can reduce that throughput
and also reduce the number of spans that we are sending, which becomes
easier for any engineer to start debugging and
see only the actionable samples. There is
another form of sampling which is probabilistic sampling. This is
entirely different from head based and tail based sampling.
In probabilistic sampling, you set the number of
the set the sampling percentage of how many samples
you want to capture from that particular system. It can be 15%
to 60% or even 100%. This is particularly
recommended, in my own opinion, that you can start using
for any new projects that you are deploying. This helps you understand the
behavior of your system and once you have understood what
percentage of samples you want, you can switch to database sampling and
refine your policies to get the most actionable samples from
that particular service or infrastructure. Let me show you the
configuration file of the collector that I was using locally to
export all the telemetry from our node j's application and
also how you can get started with collector by
running it simply in a docker container. You can
start using the open telemetry collector locally by running the docker image.
There are various versions of open telemetry collector image that are available
on the Docker hub. One particular image that you should
be using is open telemetry collector iPhone contrib. This contains
most of the processors, exporters and receivers
which are not available in the primary mainstream branch
which is open telemetry collector. The country version is where most
of the community and the different plugins
are available and are being contributed to. One thing
that you need to be mindful is you need to have the ports 4317
and 4318 open. You can get all this information
directly from the open elementary country, GitHub, repo or the Docker
page. Once you have this container up and running,
you can start using the collector to receive the metrics and export it.
Since we have already configured our application with instrumentation,
let me show you what the configuration file that we use. The collector
requires a config YAML to be present with which it's
actually driven. There the three main components that we talked about,
the processors, receivers and exporters.
These. These are the three blocks that are made within the collector
and with this we are able to configure how we
want to receive the telemetry. What we
want to do with the telemetry if you want to process that,
attach any custom attributes and where we want to export it,
we can see all the debug output with collector as well, similar to
the SDKs. By adding the debug attribute and
where we want to export it, here we've here I'm
exporting it to newly. So I've added the exporter endpoint,
which is OTLP endpoint dot in our data. With my particular license,
you can add multiple exporters to different observability backends,
or if you just want to extort it in time series database, you can
do so too. The particular important
block in the configuration is the service pipelines.
Here is where you will enable all these processors,
receivers and exporters. In the pipelines
we we add what to enable, for example for receivers,
I'm enabling OTLB for all of the traces, matrix and logs
for processors. What processors to be used for individual telemetry
and for logs for exporters. Which one you want to export?
You may not want to export everything, but you just want to start debugging.
You can remove the exporter from the pipeline and the collector will
still process that without exporting it. For example, in the processors for
logs and I'm attaching a custom attribute which is the environment,
you can choose to add multiple processes for only one
telemetry or all of it. With that, I want to
conclude my topic of open telemetry 101 today
to just recap and give you a few highlights of everything that I've covered.
First of all, it's an exciting time for open source observability.
Open elementary is growing and being adopted at a very rapid pace,
and not just in terms of contribution from the community, but also in
terms of adoption. Like GitHub,
Microsoft, Neuralink are contributing heavily to include it
in their own ecosystems. But you need to be mindful with your maturity
and you have to plan ahead with the adoption of open telemetry. Start with
automatic instrumentation and then advance towards manual instrumentation
as a way to understand and mature of what is
important within your system. Just having a form of automatic
instrumentation and collecting telemetry is not observability.
Your instrumentation should include a proper contextual information for traces,
logs and metrics to improve observation. Remember the
example that we covered where we are able to see logs, metrics,
errors and traces all in a single place. That is
the complete powerful observability system where
you are able to reach to the root cause of your problems.
You can deploy the collector easily. Together, there are multiple options that are
available. You can deploy it as a standalone, as an agent,
or as a gateway behind a load balancer. Or if
you are using kubernetes you can also deploy it on Kubernetes in various
modes of demonstrate, stateful, set or even as a Kubernetes
operator. You can start collecting data from all your pipelines
as well as multiple distributed systems which can help
you with your MTDI,
MTD and MTDR. One of my final advice
that I would like to close with is there is a lot of active investment
going on with open telemetry and it helps engineers
work based on data and not opinion. Thank you
and I'll see you next time.