OpenTelemetry 101: A Beginner's Guide to Tracing, Metrics, and Logs on Observability

Video size:

Abstract

OpenTelemetry is a powerful framework that enables the collection and management of telemetry data from distributed systems. This talk covers the usage of OTel to instrument a simple application to capture Traces, Logs, and metrics, and exporting it to various observability backends.

Summary

Open telemetry is an incubating project in CNCF landscape. It provides a set of APIs, libraries, integrations to connect the telemetry from across your system and services. With open telemetry, you are free to choose where you want to store your telemetry. There is a strong interest in modern observability.
Open telemetry protocol is used to send all the telemetry that is collected from your applications, infrastructure or services to the choice of your back end. The easiest option to get started with open telemetry is automatic instrumentation.
Once we start utilizing our wrapper with our node application. We can capture the metrics from the start of our application. All these attributes are being captured by SDK using the automatic instrumentation. The SDK prints out the debug logs for all of the requests that are coming in.
Let me click on my services and my service convert 42 hotel is already available here. Some metrics are already coming in. I'll quickly just switch to distributed tracing to look at the rest APIs and the traces and analyze them. Neuralink helps us get to the point quickly.
We have set our tracing for unreal application using the automatic instrumentation trace provider. Once you have identified the areas, or once you know what areas you want to instrument, you can use the manual instrumentation and customize your error message. Once we have all this information, it becomes easier for us to triage and understand the behavior of our application.
Open industry provides us with packages to help us capture the metrics from our applications. In this part, we'll focus on configuring our metrics and extracting the metrics of our application. Since meter can become too aggressive and EW cpu cycles, it is recommended that you configure it and tune it as per your requirements.
For the logs we focus mainly on the API logs, SDK and another package which is SDK logs. Using this package, we'll transform our logs into standard open dimension format. Any customization that you want to add to your log record with open telemetry and bernouin you can do so here.
Beauty of all of this setting up logs and traces and metrics comes into picture. Having the right context is very important. The trace cycle and logs thus operate can help you avoid pointing fingers to dev or ops. Also helps you reach to the root cause of your application problem very quickly.
Open telemetry collector is a very important piece of open elementary. It helps you capture information from your infrastructure as well as your microservices and different applications. The collector is built with three different components, receivers, processors and exporters. You can get started with collector by running it simply in a docker container.
Open elementary is growing and being adopted at a very rapid pace. Start with automatic instrumentation and then advance towards manual instrumentation. There is a lot of active investment going on with open telemetry. It helps engineers work based on data and not opinion.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Let me give you an overview about open telemetry. Here's everything that I'll be walking you through today. I'll give you a brief background about open telemetry, the core concepts, the building blocks and architecture of open telemetry. We'll quickly dive into the instrumentation part where we'll see the code and we'll start instrumenting with traces, metrics and logs for a simple node JS application using the co op framework. It's a very basic application where we try to cover all these concepts and try to extract telemetry that makes sense to us. Lastly, I'll also cover about open element collector how you can get started, how it's beneficial and when you should be using that. And then again, we have heard from developers across the world it works in my machine. It's an Ops problem, Ops complaining that it's the app problem. Today I've even heard people complaining that it's my container is working fine, you're just not deploying my container correctly. Let's see how open telemetry or observability helps resolve this conflict in today's world. A quick background about open telemetry open telemetry today is an incubating project in CNCF landscape. There has already been a proposal made to make the project to a graduate state. It was originally formed in 2019 by a merger of two famous projects, open tracing and mobile census. Open tracing was developed by Uber to monitor their terms of microservices and open sensors was developed by Google for the purpose of monitoring their microservices and telecommetrics. Some of the core goals of open telemetry is to provide a set of APIs, libraries, integrations to connect the telemetry from across your system and services. It helps you set the standard from to collect the telemetry from all your application and infrastructure. One of the best part about open telemetry is how it provides you an option to set all this telemetry collector to a choice of your observability back in, which means you're not logged into a single vendor or any specific tool. Regardless of how you instrument your application, services and infrastructure. With open telemetry, you are free to choose where you want to store your telemetry in house, third party, or a combination of both. You'll see in this in this chart how quickly open telemetry has risen. You'll see that today. Open telemetry is second most fastest growing project in CNCF space. It is right behind kubernetes with the number of contributions and adoption. This is because there is a strong interest in modern observability. There was a report by Gardner in 2022 that a lot of companies are looking to enhance open standards, which is what open telemetry, EBTF and Grafana are working towards. If you want to read more about it, you can scan the QR code on top right? Let's see some of the core concepts and the building blocks of open telemetry open telemetry is basically a specification. It's not as one specific framework language or an SDK. Open provides the specification with which each individual languages and frameworks develop their own set of SDKs. Now, these SDKs are built on top of the API specification provided by open telemetry. These APIs are based on tracing metrics and logging. All these APIs follow the same semantic convention across so that any language or framework that are being built for these using open telemetry remains standard. And today you might instrument Java application and tomorrow you have to instrument bay with that. Having the same specification and semantics. You do not have to worry or revisit the documentation or reinvent the vein often. Most SDKs today provide you with the option of automatic instrumentation. For example, no JS provides you an option of automatic instrumentation with libraries like Express core, MySQL, and any other common framework that are being used with node JS. We'll see about that in some time when we get into the hands on part of that. Lastly, one of the important protocols and part of open telemetry is open telemetry protocol. This protocol is used to send all the telemetry that is collected from your applications, infrastructure or services to the choice of your back end. OTLP works on two famous protocols. One is HTTP or GrPC. Depending on your system architecture or your requirements, you can choose to use either, or you can choose to use both. Let's quickly get into the hands on part and see how we can get started with instrumenting. Simple service as no. J's, the conventions remain same across other languages. The APIs and APIs are similar. Only thing that will change is the packages and some SDKs. Here's a very simple application which is built on top of known express using the core framework. Core is a very simple light framework similar to Express, which helps you write rest APIs. Quickly we'll get into each of the packages of open telemetry that we'll be using. Let's start with the tracing and automatic instrumentation with node for node JS, how we export that telemetry to our collector and moving towards metrics and combining all these three, sending it to collector and how collector exports all these metrics to Neuralick new relic is one of our observability backends. Neuralik helps you provide the contextual information by stitching together all the telemetry exported from the collector. This is a very simple application as I mentioned. You'll see that there's nothing much. It's just a very basic application which has a certain endpoint. Here I have at least four API endpoints which is one root path, a post request, and another request with get which accepts certain parameters. Each of these requests will be automatically capturing the traces with by using the open elementary SDKs. Now to start with the safest option and the easiest option that you can opt to get started with open telemetry is automatic instrumentation. We will not be modified anything from in the source code of this core. Instead, the recommendation from open telemetry for node JS is that you create a separate wrapper file which will be the primary module for you to start your Nord Express applications. We will start with setting up this file and adding all the packages. I'll walk you through about the details of each package that we are using and the topic that we will be focusing on. Firstly, we'll start with the tracing. For that we'll be focusing on a couple of packages which are automatic instrumentation for node SDK trace node SDK trace base and if you and you need to focus on what we are importing from each of these each of these different packages for the automatic instrumentation, it provides us an API called Get Node auto instrumentation which helps you capture the telemetry automatically from all of your node js and underlying libraries. There are some conventions for you to set up your open telemetry service, right? For that we'll use some helper packages like semantic convention resources which helps us configure our application name and other attributes of our application correctly. Let me quickly scroll down to the part where we are setting up the tracing for our application. We'll ignore everything else that's configured for now and point you towards what is important for you to get started quickly. Firstly, we require a tracer provider. A tracer provider is an API which will register your open register the application with the open telemetry API. Now here is where we provide our resource. Now this is the resource where we have set up our application name. This is the most basic configuration that we are setting up here. What we'll do is we'll just add our resource name which is will be our open elementary service name. Once we have added our name, we can set the configuration of how frequently we want to flush our traces. Flush basically tells the SDK how frequently you want the telemetry to be sent out. Once we have the provider API configured, we have to add a span processor. Now traces are built of multiple spans. Each span is a is an operation within your application that can pull information about its execution period, what was the function, and anything that happened in the specific operation like error or exception. Span contains all that information and building stitching all these spans is what is called as a straight as a trace. Now in this trace provider we add a span processor. A span processor basically sets up an information. A span processor basically sets up and tells the SDK how each of these spans should be processed from this application. For this particular example, we'll be using the batch band processor. It is also a recommended processor so that there is not much frequent there is not much frequent export so that there is not much operational load on SDK for processing the span. The batch span takes a number takes a few optional settings which are default. All the values that you see here are the default values. You can increase or reduce this as per your requirements. Basically batch plan processor collects all your span and processes them in a batch. It takes another parameter which is an exporter which says where all these process spans have to be exported to. For this one I've configured it to a simple character. Character is will be running in my local setup in a container. I talk about container towards the very end. Once we have instrumented our application covering the topics of traces, metrics and logs, all of these will be exported to our collector and our collector will be exporting it to other to our observability backend which will be neurofic. Now in the collector there's only one configuration that's required because the collector is running locally. A simple localhost URL which is the default URL for open telemetry collector traces API and once we have done that, added our batch span processor, we can optionally register certain propagators. Now in the same trace provider will register w three c baggage propagator and trace context propagator. These propagators helps us find the origin and stitch the request that is hopping through multiple services. I'll show this in example of what it looks like once we have included these propagators. These basically helps you give you the overview and complete picture of how many different services your request has opt on and what was the operation or what was the problem at particular specific service that happened and helps you capture that information by stitching it all together. Once we have configured everything for our trace provider, we need to register our instrumentations. Register instrumentation is part of a open telemetry instrumentation library and you can import that and start adding our instrumentation. Basically, register instrumentation tells you what what is that you want to focus on this application's instrumentation. First thing we need to provide it provider for us. The tracer provider is our trace provider. That which is configured and in the instrumentations list, which is an array, will provide everything that we want to focus on. The get known auto instrumentation library has tons of libraries which can automatically capture metrics and traces from your application. I do not want to capture any of the file system operations that node Js or my Goa framework or co op framework does any operations on. I do want to capture that anything that's happening with respect to core framework be captured and in the same auto get node auto instrumentation library instrumentation and in the same get node auto instrumentation configuration you can add much more. For example, if you have my sequel or anything, you have tons of different libraries which are prepackaged and all you have to do is add it to this register instrumentation list and it will start capturing that information. Now we'll focus on this part later. Once we go through the logs part, this specific configuration will be focused focusing on decorating our logs. Now, this is all the configuration that's required for you to start capturing your traces from your node application. Let's quickly see in the console window what the traces looks like once we start instrumenting. Once we start utilizing our wrapper with our node application. Now in my terminal, before I start my application, I'm busking a few environment variables. These are the two environment variables which are helper, which are basically helper configurations. For me to start this application, the hotel service name is what will be referred by the wrapper, which is going to provide you a name of that specific service. And the portal log level basically helps us debug all our configuration that we just did in our wrapper file. This particular command basically tells node that before loading the main file, which is our index js, load the portal wrapper and then load the index js. This basically loads hotel wrapper as the primary file and then executes our index js. This way we are able to capture the metrics from the start of our application. Now, my application is running successfully on port 3000. Now this is a typical behavior of any basic Nord express or node core of node core service. Let me add another send some request to this particular service that is being run. I hit the roots in I hit the root of this specific API and you see I hit the root API of this service that's running and you'll see that my API responded and got executed. I got a console log from my application which is saying my service name, what is the host name of where it is running, what is the message and the timestamp. This is a typical logging of console dot log or any logging library that you are using. For this instance, I'm using the bunion library for my logging application and it adds certain attributes. The rest of the output that you are seeing is not from the API, but it is actually from open territories. Debug logs this is a typical trace that is, that gets exported and this is, these are the attributes that are automatically attached. You'll notice the trace id and span id that got attached to this specific resource. Let me call on the API now. You'll see I got the response here and the debug output from my orbital metric. The SDK prints out, the SDK prints out the debug logs for all of the requests that are coming in now. Be mindful that we did not modify anything in our actual source code, which is our index J's. All we have done is added a wrapper wrapper around it and all these attributes are being captured by SDK using the automatic instrumentation. I'll start my application with certain environment variables that I require for my hotel SDK, one of which is hotel service name. Auto service name is basically telling what the service of my name should be when I start executing this or export it to any observatory backlight. The other variable that I have here is auto log level. This helps me debug any of the problems that are occurring in the hotel configuration as part of the wrapper file. And the command here just basically tells node to load the wrapper file before loading the main module, which is the index js. Let's execute this file and see what happens. You'll see that there are a lot of debug statements that got printed. This is all because we have set the log level to debug. It says that it's trying to load the instrumentation for all these libraries, but there is not many, there are not many libraries that it's found. Only the library that it finds is for Node J's module and HTTP and also for core, and it's applying patch for that. Basically the libraries which are pre built as part of automatic instrumentation, it patches onto that. So any requests that are going or any operation that have been made from these libraries or any operations that are happening as part of these libraries are captured. Now you'll see that there are a couple of libraries which is bunny are being also been patched. We can do it how the logging works, but for now the logging is disabled. Once we have enabled that, you'll see that the middleware framework is being patch, which is our core, and some more debugging statements that are being generated. Let's quickly scroll down and we make a request to our service here. What I'll do is make a simple call to the root API that are configured. It's just going to return a simple hello world hello world response and let's see how the trace is being generated from open telemetry. So this is my logging line from my bunny logger, and let's see what happens after that. All this output that we are seeing right now is the actual trace that is being generated from our particular request. So if I scroll to the top, this is my log line that got hit and generated as soon as I hit my API. And these are the spans that got created, which is basically capturing all the information of execution of this particular API. You'll see it's capturing the metrics from the core library. It's having certain attributes. What kind of span processor are being used? What is the different body parsers that have been used that we are using in our application, the different span ids, as mentioned earlier, multiple spans and build a complete trace. And it provides a parent span id. So whatever the first request had, the parents, what are the first requests that came into the system becomes the parent span, and that id is attached to all the rest of the lifecycle of that particular request. There's a lot of output, again, a lot of debugging output which we will not focus on. We see how this output looks like in our observability back end once we have exported. For now it's getting exported to collector and routed to a backend. I'll show you directly how it looks like in the backend. I'll cover deeply about connector and how we can configure that we are exporting all our telemetry to Neuralink using the OTLP protocol. Let me click on my services and my service convert 42 hotel is already available here. You'll see some metrics are already coming in. Since I'm not capturing any metrics for my service yet, I switch to spans that are the cache captured from my service using the hotel SDK. We'll see. Some of the requests and response time are already being available here. I'll quickly just switch to distributed tracing to look at the rest APIs and the traces and analyze them. I'll click on this first one and start to see that there is this particular request. Let me click on the first request in this list. This is giving me an overview of what the request response time cycle was. It took 3.5 milliseconds for this request to get completed. There are certain operations that happen and if I just click on this particular trace you'll see there are attributes that got attached which we have seen most of it in the debug window, the entertainment, the duration, what was the SDB flavor, what was the host name, the target? So for we made a request on as a root rest API and that's what's being captured, the id of this particular trace, what was the type of request, and you also see how this actually was captured. The instrumentation hyphen HTTP this library was patched and it has captured this particular request and the library version. There are certain other attributes which are of help. If you have attached any custom attributes, it should also get listed in the same space. This is regardless of where you export your telemetry. It should give you an experience similar to this. Neuralink helps us get to the point quickly and that's why we're able to materialize all the traces fairly quickly. Let's look at some other trace. Let me make some two different APIs. I have the endpoint to which I'll make a request which is giving me basically weather information about the particular location. So I'll change the location to where I'm currently at and the API responded very quickly. So this particular endpoint basically makes a request to an external service which is also instrumented using open telemetry. But that service is not on my localhost, it is actually deployed elsewhere on an AWS instance. Let's see how open captures the essence of this request and helps us get the stigma information to give us a complete picture. I'll make a few more requests so that I'm able to have sizeable data that we can go through. I'll also make a request that can that should fail, with which we'll see what the errors look like. Once we start implementing our application using open telemetry. I'll just give a blah blah land and we'll see this specific request returned with 404. Now to show you quickly what happened, I'll just show you in the terminal. How many requests happened, what were the debug logs and if there were any errors. So you'll see this specific request failed and had a 404. So my request has completed, my trees got generated and this was the parameter that I have called. That was my HTTP target. Keep in mind we have not modified the source code of our application. We have just added a wrapper around our main application. So this is helpful if you want to quickly get started with open elementary so you don't disturb the existing application code and start experimenting with SDKs. Let's look at this specific trace in our back end coming back to our distributed tracing. I'll click on trace groups now I see that there are few more trace. Let me click on this and get into it further. Now there is this one request on Toggle, which was the most recent request and it has the most duration of all the other requests. I'm assuming this should be the request for weather. Let me click on this and as you see here, you'll see a map of the journey of your API request and how many other services it has topped my original service node js, from which I'm from which I made the request, and it has made an external call to an external service which is instrumented with open telemetry, which is in turn making another external request which is making at least eight different calls. And now that is unknown because that service is not incremented. Let's expand and see what happened underneath. You'll see the information of all the operations that happen underneath. There was a get request which we made from our system. It went to API weather request of Node express service, which is the Olodexpress portal, and there was some middleware operations and it also made a get request which is an external call. We can check what is the service that this particular service made a request to. You'll be able to understand what service external services are causing slowness in your application so that you are able to improve that particular area and aspect of your application. So basically this service is the open weather map application, open weather map service, and from which from this particular service it's requesting all the weather information. Once we have all this information, it becomes easier for us to triage and understand the behavior of our application, not just in happy scenarios, but also in problematic scenarios. Let me quickly go back and click click on the errors. You'll see that this particular request which failed at 404. Now this is being highlighted here. I quickly want to understand what has failed and new link provides you a good map and overview of which services were impacted. In this particular map, both the services are highlighted as red, which means both these services had some form of errors. We have seen that individual operations being captured in the spans, but if you want to focus on your errors, there's this convenient checkbox that you can just click and you'll see that there is a get request which actually failed and there was an error. Now this, since this is just automatic instrumentation and an external service, there is not much details. Now this is where manual instrumentation comes into picture. Once you have identified the areas, or once you know what areas you want to instrument, you can use the manual instrumentation and customize your error message, or even add additional spans to support your debugging and analysis journey of your applications. That's all about the tracing. We have set our tracing for unreal application using the automatic instrumentation trace provider and stress train the instrumentation using the pre built pre package library release as part of the automatic instrumentation. That's as simple as it to get started with automatic instrumentation for your node j's applications, let's get back to our code and include a metric instrumentation open industry provides us with packages to help us capture the metrics from our applications. In this part, we'll focus on configuring our metrics and extracting the metrics of our application over similar to the traces, there are a couple of packages that we need to be aware of, one of which is SDK matrix. This particular SDK provides us with meter provider, the exporting readers and helper function for debugging, which is console metric exporter. Similar to setting up a trace provider. What we do is we start a periodic exporting meter where we are configuring our meter provider to send all the metrics being captured to a console. But first, what we want to do is capture all these metrics from our application and export it to a packet. In this case, we are setting up a OTLP metric exporter without any particular particular URL. One of the default settings of any exporter SDKs for any of these traces, metrics or logs is that it always points to localhost 4318 or 4317 depending on the protocol available. It tries to export it directly to that. The connector supports both 4317 which is receiving on GRPC, and 431, which receives on HTTP. Now for meter, once we have our meter exporter, we'll start with setting up our provider as it races require a trace provider. Meter also requires a meter provider. What we'll configure here is part of a meter provider. We just supply a name again, a resource which has a service name and the number of readers that we can add. This can be an array or a single value. Here I'm adding both a console metric reader and a metric reader exporter which is going to send it and export it to our observability packet. Once we have our service provider once we have our meter provider service, we can register it globally in two methods. One is using the open telemetry API which is open telemetry metrics and setting a global meter provider. We can use this we can use this in cases where you do not have trace provider or register instrumentations API available. With this you are able to configure just the metrics provider. But since we are using the automatic instrumentation API and a trace provider with registered instrumentation, we are going to enable this in the larger scope of our application as part of the list of instrumentations. The API that also accepts is meter provider. Here we can specify the provider that we just configured, which is our my service provider which contains the console metric reader as well as metric reader exporter, one which exports it to our console, another which exports it to a bucket. That is all the configuration that's required for us to enable the meter provider. Let's look at how the output changes for all our open telemetry metrics. I will disable the tricks debug information so we are only able to see the information from our meter providers. Let me switch back to console and restart my application quickly clear the console. I'll retain the logging level since I have modified it directly in the go. Everything else remains the same. Now we see a lot of debug output because we have just disabled that, but we will start seeing the output from the meter provider once we start hitting any of our endpoints. Let me quickly hit some of the endpoints and see what it looks like. One of the configuration that we have enabled is the flush timings for the console we have set it to, but you can go as aggressive as 1 second. It is the default settings is set to 60 seconds. Since meter can become too aggressive and EW cpu cycles, it is recommended that you configure it and tune it as per your requirements. Now this is the output from our meter provider which is capturing the information for histogram and it is capturing the duration of our inbound HTTP request, which is in this case is our endpoints that we are just calling. There are some data points and values that it attaches, but it's basically easier to visualize this rather than reading the raw data, but it also helps you understand what kind of data is being captured with these console outputs. If I make a few more calls to different endpoints, I'll start seeing similar output. There's not much difference except for the value and start and end time of that request registered or the application level basically captures all all the operations for available libraries. Since we also registered this as part of our automatic instrumentation, it's going to capture all the operation for each of these endpoints that happen in our let's switch back to our observability backend which is neuralink, and see how the metrics are being reflected. As earlier, all the traces that were being captured were exported via collector. Metric is also been sent into the new relay via collector. I'll switch to metrics in this instance and you see now the metrics chart has started getting populated metrics. Capturing these metrics helps you populate these charts, which gives you insight about your response time, your throughput, and if there are any errors, you also start capturing those details. Additionally, if you want to dig further into metrics, what are the different metrics that you have captured? You can go to the matrix Explorer and see for yourself. I'll come back to my goal and now that we have captured the traces and metrics, it's time to focus on one of the most important telemetry logs. One of my previous mentors had a famous saying for all the engineers loves logs. If there are no logs, there's no life. And that's particularly true especially for DevOps and SRE engineers. If the services go down, they start digging to the logs and try to identify what has actually gone wrong before they can recover the services. Let's focus on the logs aspect of instrumentation. For our node J's service we have focused on metrics, which was fairly simple. All you require is a SDK metrics and the exporter. For logs it's similar, but requires a few more steps than setting up your meter or your matrix provider. For the logs we focus mainly on the API logs, SDK and another package which is SDK logs. The SDK logs provides us with logger provider, the processor and the log exporter. These APIs helps us set up our application to properly set up the logs attached with all the information with regards to traces and metrics. We'll see how all this ties up towards the end. The library that I'm using as part of, as part of our application here is Bunyan. Bunyan is simple logging library, which is a very famous library for adding any any kind of logger for simple service. There is a library available already. If you're using Bernie. There is a library called Instrumentation Bunyip which helps you capture logs in open telemetry format. We've seen in the console that logger log format of bunion is slightly different. We'll see how that changes automatically. Without modifying any of our application code. Using this package, we'll set up our logger to start using and transforming our logs into standard open dimension format. Now firstly, we require a logger provider which again accepts a resource and our resource is again the same global object where we are setting up the service name. This is particularly important if you want all these material traces, metrics, traces and logs attached to the same service. If you do not provide the name, it's assumed as unknown service. That's the default name that it accepts. It's always a good practice to add your own service name and the default value for exporter is this endpoint which is localhost 4318 version one logs each of the exporter endpoints and each of the exporter exporter APIs for different SDKs have these dedicated endpoints configured. I've included here for your reference. If even if you do not add this particular endpoint is going to point it to the default connector receiver to the default endpoint. Once we have our exporter and provider, we can configure and attach the processors that we require. Similar to the span processing logs also have different processor which is simple and batch processor. I'm using a simple processor for the console exporter and batch processor for exporting it to our back end where I'm using the exporter similar to the metrics provider where we can register it globally. If you want to capture only the logs from your application, you can use the portal logs API which is open telemetry API with logs and use the global logger provider. But since we are using the automatic instrumentation, I'll go ahead and register it as part of the register instrumentation and that is all that's required for you to successfully include logging with open telemetry in your node application. Once we have enabled the logger application, we do not have to modify anything. Since I'm already using the bunion here, it automatically patches the instance of bunion with open telemetry logger. Once we enable our provider with the logger provider and register it with the list of instrumentation, we can add additional options that we want for the bunion instrumentation to modify. For example, it provides a log hook option with which we can modify the log record and attach any attributes that we want. For example here I'm attaching the resource attribute which is the service name from the trace provider. Any customization that you want to add to your log record with open telemetry and bernouin you can do so here. Once I have enabled this provider and added my instrumentation, let me restart my application. I keep the command as same and you'll see it's running. My application is now running. Let's hit the endpoint and see what the output looks like. Now you'll see apart from the default application log line which was coming, which is the bundling standard of logs, there is another output which is coming from the logger provider of logs and the information of all the trace id span ids, the severity is coming. This is because once we have included our log, it attaches all the other information for that particular trace and any additional custom attribute that we have included. Now in this case, the attribute that we have added as part of the logbook is the service name. This can be particularly helpful if your application is running on multiple hosts and you're streaming all the logs to a central location. This can be helpful to identify which particular service is breaking and where the location is. Once we have set up our logging we can see this in the backend. Let me quickly generate some more load so we can see all the logs for different I'll just hit it a couple of more times. One, two, three and let's switch to our back end there which is nearly and see what the logs look like. You'll see conveniently there is a logs option within the same screen which you can click on. Once I click on this you see that all the logs have already started to flow in and they have eight requests that I've already made. Once I click on any of this request you'll see that all the logs, the body, what is the service name, the span id, the trace id have been attached, but this is not present as part of the standard logging statement of bunny. Any logs that are being generated as part of our application is now patched with open elementary logs provider and it decorates our log message with all this additional metadata. Beauty of all of this setting up logs and traces and metrics comes into picture now to having to reach to the root cause of any problem, having the right context is very important. For example, let me make another request to my weather API where I'm going to fail it with a wrong parameter. Once my API has failed, I'll fail it a couple more times. Now I'll come back to my back end, which is neuralink, and I start seeing all this very stitched together and providing me all the context of the failed request. You see that information like matrix and span are all available already, but what I want to focus now is on the errors. We've seen the errors in the context of traces and how that looks like. What I'm particularly interested now is to see the relevant logs of that particular trace. Let me quickly switch back to the distributed tracing and switch clip on the errors. You see the three requests that just made have failed and there are three different errors. I click on this particular trace and I'm seeing the errors for these particular services. This is something we have already seen as part of the trace exploration. But now what I'm interested is to understand the logs related to this particular request. I do not have to navigate away from this screen now. This is particular to the neuralink platform, but this is also a beauty of open telemetry. The place id and span id that got attached to the log statements come into picture of being really helpful here you'll see a small logs tab on the top. And once I click on this, you'll see there are a couple of log lines which are having from particular request and that specific function. Now, in a scenario where you have tons of requests and that particular trace, and a particular trace fails, you would want to understand that particular log. And to go through the tons of logs is already tedious. Having the right context helps you get to the root cause really quick. In this case, I'm able to reach that particular log line without having to navigate away much. The other way to reach to this particular stage is through log screen. Let's say I'm exploring logs from all these services and I see there are a couple of errors and I want to understand what trace was this that actually failed. I see that trace id is already attached and there is a log message that's also available. But what is also available is getting to that specific request directory. Once I click on this request, you'll see that it opens directly in that trace. And this completes the cycle of combining the traces, logs and metrics. It provides you the complete metrics. The trace cycle and the logs thus operate can help you avoid pointing fingers to dev or ops. Also helps you reach to the root cause of your application problem very quickly. Now that we have seen how we get started with automatic instrumentation for load j's and capture traces, logs and metrics, let's look at another important piece which is open telemetry collector. The collector is a very important piece and part of open elementary which helps you capture information from your infrastructure as well as your microservices and different applications. The collector is built with three different components, receivers, processors and exporters. Receivers is where we define how we want to get the data into collector which can be push or pull with the application automatic instrumentation that we have covered. We are using the push based mechanism where we are sending all the telemetry using the SDKs processors is which where we have we can define how we want to process our telemetry. We can modify, attach any custom attribute or even talk attributes that we do not want. Exporters again work on the same principle of push or pull with where we can export the telemetry to multiple or single packets. Basically it acts as a proxy where multiple telemetry formats can work as an agent or a gateway. If you want to scale the character, you can send it up behind a load balancer and it can scale up as per your requires. Here's a simple example of open data collector where we are using the configuration file to add a receiver for collecting host metrics. You'll see once we add a simple host matrix block in a Yami file, we are able to capture all this information which is system dot memory utilization file system information, networking and aging information. There are tons of more information that you can capture. All you have to do is define in the receivers block for host memories. One of the important concepts with collector is how we sample our data. We have seen different processors for span and long which is simple, and batch processors with connector. The concept of sampling comes into picture of what we want to capture and how we want to capture. There are two most famous strategies, one of which is head based sampling and another is stain based sampling. Headbase sampling is the default that is enabled where it can just overall statistical sampling of all the requests that are coming through and the tail based sampling is something that captures and gives you the information for most actionable trace that are sampled. It helps you identify the portion of the trace data instead of the overall statistics of the data. Tail based sampling is recommended where you want to get only the right data instead of the tons of data that are coming through and flowing in from across your systems. You'll see how it changes depending on the sampling strategy on the left hand side in the configuration you will see in the processors block, we are defining a policy to enable the database sampling. You'll see the throughput that has changed before and after. Before we apply the database sampling, there has been a lot of throughput and cpu cycles that collector was consuming. This is because it tries to process all the information that is coming in and tries to export that. With tb sampling we can reduce that throughput and also reduce the number of spans that we are sending, which becomes easier for any engineer to start debugging and see only the actionable samples. There is another form of sampling which is probabilistic sampling. This is entirely different from head based and tail based sampling. In probabilistic sampling, you set the number of the set the sampling percentage of how many samples you want to capture from that particular system. It can be 15% to 60% or even 100%. This is particularly recommended, in my own opinion, that you can start using for any new projects that you are deploying. This helps you understand the behavior of your system and once you have understood what percentage of samples you want, you can switch to database sampling and refine your policies to get the most actionable samples from that particular service or infrastructure. Let me show you the configuration file of the collector that I was using locally to export all the telemetry from our node j's application and also how you can get started with collector by running it simply in a docker container. You can start using the open telemetry collector locally by running the docker image. There are various versions of open telemetry collector image that are available on the Docker hub. One particular image that you should be using is open telemetry collector iPhone contrib. This contains most of the processors, exporters and receivers which are not available in the primary mainstream branch which is open telemetry collector. The country version is where most of the community and the different plugins are available and are being contributed to. One thing that you need to be mindful is you need to have the ports 4317 and 4318 open. You can get all this information directly from the open elementary country, GitHub, repo or the Docker page. Once you have this container up and running, you can start using the collector to receive the metrics and export it. Since we have already configured our application with instrumentation, let me show you what the configuration file that we use. The collector requires a config YAML to be present with which it's actually driven. There the three main components that we talked about, the processors, receivers and exporters. These. These are the three blocks that are made within the collector and with this we are able to configure how we want to receive the telemetry. What we want to do with the telemetry if you want to process that, attach any custom attributes and where we want to export it, we can see all the debug output with collector as well, similar to the SDKs. By adding the debug attribute and where we want to export it, here we've here I'm exporting it to newly. So I've added the exporter endpoint, which is OTLP endpoint dot in our data. With my particular license, you can add multiple exporters to different observability backends, or if you just want to extort it in time series database, you can do so too. The particular important block in the configuration is the service pipelines. Here is where you will enable all these processors, receivers and exporters. In the pipelines we we add what to enable, for example for receivers, I'm enabling OTLB for all of the traces, matrix and logs for processors. What processors to be used for individual telemetry and for logs for exporters. Which one you want to export? You may not want to export everything, but you just want to start debugging. You can remove the exporter from the pipeline and the collector will still process that without exporting it. For example, in the processors for logs and I'm attaching a custom attribute which is the environment, you can choose to add multiple processes for only one telemetry or all of it. With that, I want to conclude my topic of open telemetry 101 today to just recap and give you a few highlights of everything that I've covered. First of all, it's an exciting time for open source observability. Open elementary is growing and being adopted at a very rapid pace, and not just in terms of contribution from the community, but also in terms of adoption. Like GitHub, Microsoft, Neuralink are contributing heavily to include it in their own ecosystems. But you need to be mindful with your maturity and you have to plan ahead with the adoption of open telemetry. Start with automatic instrumentation and then advance towards manual instrumentation as a way to understand and mature of what is important within your system. Just having a form of automatic instrumentation and collecting telemetry is not observability. Your instrumentation should include a proper contextual information for traces, logs and metrics to improve observation. Remember the example that we covered where we are able to see logs, metrics, errors and traces all in a single place. That is the complete powerful observability system where you are able to reach to the root cause of your problems. You can deploy the collector easily. Together, there are multiple options that are available. You can deploy it as a standalone, as an agent, or as a gateway behind a load balancer. Or if you are using kubernetes you can also deploy it on Kubernetes in various modes of demonstrate, stateful, set or even as a Kubernetes operator. You can start collecting data from all your pipelines as well as multiple distributed systems which can help you with your MTDI, MTD and MTDR. One of my final advice that I would like to close with is there is a lot of active investment going on with open telemetry and it helps engineers work based on data and not opinion. Thank you and I'll see you next time.

Slides

Download slides (PDF)

See all 22 talks at this event!

Conf42 Observability 2024 - Online

June 13 2024

OpenTelemetry 101: A Beginner's Guide to Tracing, Metrics, and Logs on Observability

Video size:

Abstract

Summary

Transcript

Slides

Zameer Fouzan

Senior Developer Relations Engineer @ New Relic

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2024 - Online

June 13 2024

OpenTelemetry 101: A Beginner's Guide to Tracing, Metrics, and Logs on Observability

Video size:

Abstract

Summary

Transcript

Slides

Zameer Fouzan

Senior Developer Relations Engineer @ New Relic

Join the community!