Conf42 DevOps 2023 - Online

Practical introduction to OpenTelemetry tracing

Video size:

Abstract

OpenTelemetry is a standard for tracing across multiple components. Let’s see how to set it up. Tracking a request’s flow across different components in distributed systems is essential. With the rise of microservices, their importance has risen to critical levels. Some proprietary tools for tracking have been used already: Jaeger and Zipkin naturally come to mind. Observability is built on three pillars: logging, metrics, and tracing. OpenTelemetry is a joint effort to bring an open standard to them. Jaeger and Zipkin joined the effort so that they are now OpenTelemetry compatible. In this talk, I’ll describe the above in more detail and showcase a (simple) use case to demo how you could benefit from OpenTelemetry in your distributed architecture.

Summary

  • Nicola Frankel gives an introduction to opentelemetry tracing. There are three main pillars of observability. The first is metric, the second is logging and the third is tracing.
  • Opentelemetry allows you to trace this business transaction across many different components, not only web ones. It's a merge of the open tracing and open census project. It has become a CNCF project, which is good, meaning that it has support.
  • Use case is simple for a real application, but a bit more involved for a demo. Using the Apache API six gateway. It forwards the request to my main application. Afterwards you can do more precise, more fine grained calls through manual instrumentation.
  • Kotlin spring boot ports. Everything is on GitHub. I've created my application on Springboot staller start spring IO. Now I assemble everything through Docker compose. And at the end I create a product with details.
  • Let's start this architecture. The idea is to check the traces, so I'm now on like the Jaeger web Ui. In case you misconfigure something, you can check it through the traces. It gives you insight into your architecture as well.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
You. Thanks to be here for this. Talk about an introduction to opentelemetry tracing. I'm Nicola Frankel, I've been a developer or an architect or whatever for a long time, like more than 20 years nowadays, and since a couple of years I'm also a developer advocate. So given my history, I can tell you about what happened a long time ago, perhaps even before some of you knew about it. So in the good old days, or not so good, we had monitoring. And monitoring meant that you has a bunch of people like synthing in front of a huge screen full of widgets, with dashboards, with graphs, with whatever. Sometimes they even add an additional screen in front of them and they were actually monitoring, like visually monitoring the situation. Sometimes you has even alerting or alerting came afterwards built. Basically monitoring meant you had low level information into your information system and you had to have people who could realize that hey, this slight anomaly here and this slight anomaly here meant that actually something bad was happening. It was working more or less, but actually system became more distributed. You can do this kind of stuff on a huge monolith built when you have components starting to interact across the network with different machines, with different nodes, then having this kind of insight into the system through experience becomes very very hard. And so monitoring gave way to something called observability. So you can think about observability as the next level, as to provide insight into not only one single component, but about distributed system. And there actually are three main pillars of observability. The first is metric, the second is logging and the third is tracing. And though I will focus this talk on tracing, I still need to introduce the others two because they make a lot of sense and probably you need to be aware of them anyway. So as I mentioned, the first step to monitoring was to have metrics. And it was easy because nearly all operating system give some way to release the data of how they operate. So you can have the cpu usage, the memory, the swap memory, the number of thread or whatever. And this is very easy to get. But as I mentioned, if your system become more complex and evolves more machines, it's not possible to have them. So nowadays we still rely on those metrics, but we want to have higher level metrics. They can be like still technical metrics, such as the number of requests per second, or the HTTP status code, or it can be like business metrics as the number of orders and this kind of stuff now comes logging. Logging is another level of complexity, logging. There are different questions you need to solve. The first question is what to log. For example, in the past it happened that I used like a Java agent to log whenever I entered a method and whenever I exited a method. So when I read the log files I could see which steps were used so I could do some kind of debugging in production just by looking at the log. This is good, this gives me come insight. Actually it's more like tracing, but it doesn't tell much about the business side of things. So again, perhaps we want to log more interesting stuff such as who is connected, what their session id, how much items do they have in their come this kind of stuff? This is much harder and it requests manual introversion. So you need to have a requirement and the developer like actually writing the codes to log this. Of course with auto instrumentation you cannot log every parameter of a method by default because some of them might be a password and you don't want to log the password. So auto instrumentation in this case of logging is easy built, doesn't provide much value and can be actually harmful to the operation. Then the logging formats. For a long time we use whatever logging formats the framework provided us. Whether you are using sLf four j log four j, log four j whatever. But nowadays it's perhaps better to use directly JSoN, so that when you send it to another system which expect JSON format, you don't need to have an intermediate operation that actually transform the human readable log into JSON format, like directly print it in JSON formats and you avoid one additional potentially expensive operation. Then where to log is also important. When I started coding, I was explicitly forbidden to write to console like if you use like sonar or any other quality tool, they will tell you hey, it's forbidden to use system out print ln or system er print ln. You don't write console nowadays. However with containers you probably will want to write to the console so that it can be scrapped and again like sent further. Logging on a system is good when you've got a single components, or perhaps two components. As soon as you've got a distributed system, you need to aggregate the logs together so as to understand again what happens across all those components. Therefore you need to get all the logs from all the components into a single place, into a centralized logging system. And again, that's more question basically you need to ask yourself should I send the logs? And basically meaning that my application will lose some performance to send it and actually might crash the app, or do I expose only the log so that another component can scrap them. This is how Prometheus works, for example, you add some endpoint and Prometheus will scrap the metrics. Loki will do the same. I mentioned about parsing the logs again, it's better to actually write them in the format that the centralized logging system can directly exploit than to add an additional operation in the middle. Then you need to store the log, of course. Then you need to somehow search for the log because, well, just having the logs one by one is not interesting. So you need to search for them according to some filters, for example a timestamp, for example a transaction id or anything. And then you need to display them in a certain way to look for the interesting bits, for example the components that produce them. And I didn't write it, but of course you need to somehow afterwards get rid of the logs. Otherwise your logging system will grow and grow and grow and well, you probably will have your disk full in no time. You probably used one of those systems in the past. I've been using elasticsearch a lot, it's quite useful, but any other system will do. Then comes the third components, which is tracing the Wikipedia definition. I love Wikipedia definition. In my opinion, this one is not a great one. So I've come up with my own, you will excuse me, probably it's not my own I've got inspired by lots of people that I don't remember but well, credits to them. So basically, tracing is a set of techniques and tools that help follow a business request across the network through multiple components. This is really really important. Again, in distributed system you've got multiple components. Your transaction, your business transaction will require all those components to work together to achieve the business goal. If one of them fails, you're in trouble. And if repeatedly the same component that says an issue, you need to be aware of it. So tracing the request across all those components is important. I believe you probably are aware about some of the tracing pioneers existing already. So there is zipkin, Yeager, Openfresync. Those are the three most like, widespread and famous. Every one of them has proprietary interfaces and implementation. But as you know, we want something to be standardized so that the whole software that you write, we can write libraries that have one standard. And for that there is something called the double width three C trace context configuration. It's quite easy to read, so you can already do that at home. And basically if you don't have time, let me refresh it for you. Basically you've got two important concepts. The first concept is the idea of a trace, and trace is actually like the abstraction of the business transaction. So the trace will go from entry points to the ute, most deeply nested components and back again. That's the trace. And then you've got a span. The span is actually the execution of one port of the trace in a single component. And you can have, of course for each component you can have multiple spans. If you are interested where inside this components where the flow of the request goes and each of those spans are bounded together in a parent could relationship. So the first one is the important one, the first component is the important one. It gives the first span id and then the next component will check the span id of the parent and will say hey, I am this span id. So it also generates a new span id and say hey, I'm bound with this one. And through all those parential relationship you can actually make sense of the flow and check the flow overview in one set. Opentelemetry is a set of tools that implements observerability and that relies actually on double three C trace context. So it's an implementation and more so it's compatible. And it also go beyond because w three c is good for web stuff, but perhaps you want to trace a request through, I don't know, Kafka. So somewhere where you can store and in this case the specification does not handle it. Opentelemetry allows you to trace this business transaction across many different components, not only web ones. It's a merge of the open tracing and open census project. So it's one of those few merges that were successful where people decided to joined their efforts to create something better. It has become a CNCF project, which is good, meaning that it has support. It will be supported for a long time. It's licensed under the Apache license, so it's also good if you want to use it right away. You don't need to acquire a license or whatever, and it's popular, especially on GitHub. The architecture in itself is quite easy. Basically you've got components, whatever they are, and then you've got what they call an opentelemetry collector. And this opentelemetry collector accepts data in a specific format, and this specific format allows to have these light parential relationship between spans. So the idea is on the client side you've got stuff that dumps data into the hotel collector and then you've got something that is able to search and display data from the hotel collector. The Opentelemetry collector in itself doesn't provide anything, it's just the storage stuff in a certain format. So we need something afterwards Opentelemetry provides a dedicated collector, but actually Jaeger and Zipkin, which are also like tracing providers, they are able to provide the same collector, or let's put it that way, they also provide a collector which accepts opentelemetry data. So basically what they did, they kept the storage engine. They added a new interface where you can state your data in open telemetry format. If you already have your architecture, your tracing architecture, you can easily move to Opentelemetry because, well, the collectors of Zipkin and Jaeger, they have this additional interface. You just need to change the formats and the ports because I think every one of them has different ports. On the client side it's, well, I wouldn't say easy, but the first step is straightforward. First step is auto instrumentation. This is only available when you've got a runtime. For example on the Java side you have the GM, that is a runtime. On the Python side, well, Python is a runtime node, JS is a runtime. In this case you will delegate to the runtime to do like auto instrumentation. I told you about automatically logging, entering and exiting a method. It's exactly the same here. You will do that automatically. It gives you already a lot of insight. Now if you want to go further, you can actually get the library depending on your tech stack. Again, you can check the Opentelemetry websites. You will notice there are lots and lots of stacks that are supported out of the box. And for Java there is one. For Python, there is one. For rust there is one. Whatever rocks your boat, probably you will find there. And then you can either call an API or use some annotation. As I mentioned, auto instrumentation, very easy to do. You don't need to couple your application to opentelemetry. It's a law of being fruit, so you should probably do it right away. If you are using a distributed system, it will give you a lot of insight into your application. I mentioned it's practical introduction, so here let's try to do some practical stuff finally that we have delved into the theory. So here is my use case. My use case is well, simple for a real application, but a bit more involved for a demo. So at the beginning, an API gateway. I'm using the Apache API six gateway. It forwards the request to my main application, which is a bring boot Kotlin application. Actually it gives like products API and then it has the detail for the product itself, but it relies on two other components, the one for the tracing. The pricing is implemented in Python through a sask framework application and for the stocks. So how many items do I have in which warehouse? I have created a rust application using the XM framework. So the entry point is actually the reverse proxy API gateway. Most information system have such an entry point. You probably never expose your application directly over the Internet. You have something in between because well you want to protect your information system from illegal access. So that's the most important part. As I mentioned, I'm using Abeshi API six. Perhaps you don't know about Apeshi API six. It's a Nabashi project. So basically again like good for maintenance, everything will be there with a license that will never change. It's based on the very successful Nginx reverse proxy. Then you've got like Lua jits, additional openrest layer on top which allow you to do scripting in Lua over the Nginx, and then you've got out of the box plugins. So to configure it there is this general configuration, as I mentioned, like Apache API Six has a plugin architecture. So here I say, hey, I will be using opentelemetry, it's an out of the box plugin, you don't need to write any come and then you can tell, hey, this is the name by which I want to be known in the data of opentelemetry, and this is where I will send the data. So I will be using docker compose, and so I have a dedicated Jaeger component then for each route. So here I have a single one built. You can have different configuration depending on which route. You will say okay, how much sample do I want? Normally, and depending on your volume, you probably don't want 100% because it will overflow your stuff. You want a sample here, this is a demo. I will say I want to sample everything. Again, probably not what you want to do, and then you can log additional attributes. So here, for example, I decided for no reason built for demo purpose, to have the root id, the request method and an additional hair. So if I can pass through the clients, some has and they will be traced along the span. The next step is the GVM level. As I mentioned, GVM is runtime, so I can easily use auto instrumentation. And on the GVM, auto instrumentation is for Java agents, so this is quite easy actually. I just need to pass the Java agent when I start the application and I don't need to write anything. So your developers, they are completely isolated from this tracing concern. They can write their code and everything will work as expected. This is regardless of the language and the framework because it's cheaper. So here, this is how it works. Here is my docker file to build my docker container. This is a multi stage docker file. First I will compile everything through a GDK, and afterwards I will run it through a GrE because I don't need a GDK and it's actually bigger and less secure. So the first thing I do is like normal standard built, and then afterwards I get the jar that I just built and I add the Java agent through GitHub. And when I run it, actually I run it through the Java agent. And this is as simple as it gets. You cannot be simpler. Afterwards you can do more precise, more fine grained calls through manual instrumentation. It needs an explicit dependency in the application. This time your developers need to be aware of it. And then there are two ways to do that, either through a regular explicit API call or through annotation. I'm benefiting from bring boot. So basically I will use annotation. I will issue the codes just afterwards. Okay, now it's time to delve into the codes. I will just focus on the Kotlin spring boot ports. Everything is on GitHub. So in case you need to check, you can check the python ports, you can check the rust ports. Here I will focus on the Javascript. I've created my application on Springboot staller start spring IO. So here I'm using the latest version of tools. I'm using the latest LTS version of Java, which is required by the latest version of Springboot. I'm using also the latest version of Scotland and then it's a reactive application. So I will be fetching data, bring r two dbc, otherwise I'm using webflox. So again to be reactive and the rest is just like standard kotlin stuff. I didn't want to bother myself with a regular database, so I'm using h two, using the r two dbc, h two reactive driver on the code side itself. Here I'm using coffin. So I want to use coroutines because well that's how you can do easily reactive code stuff. So I'm using the coroutine cred repository. This is my r two Dbc repository. Here I have a handler and you can see that I have suspend function. Suspend function are like four coroutines in kotlin. Then I have like one endpoint and the other endpoint. This one endpoint is for all products. This one endpoint is for a single product. Let's see the first one. Okay, and then the rest will be exactly the same. So I will fetch all product, I will find all of them into the repository, which in turn will look into the H two database, and for every one of them, which probably what you shouldn't do in real life, I will fetch the product details. So whether not whether, but their price and their availability in the stock. So here I can see how it works. Again, I will have two different calls, like protected under a nesting block, which means that here, because I'm using this picture IO, they will make the calls in parallel and we can check that it works like this in the traces. So I will get the price, I will get the stocks, then I will merge everything. And here is how I merge everything. So I transform data into the expected data. And at the end I create a product with details, including the products from the database in the catalog, plus the price, plus the stocks, that I have changed a bit. For example, I don't want to return to the client any warehouse where the quantity is like zero or less, because, well, not zero or less, just zero doesn't make sense. So I just filter them out. And at the end I'm using the pins DSL from Kotlin and the router DSL, or here the come router DSL to assemble everything. And I start my application with this bins method. So even if you are not familiar with Kotlin, if you are a Java developer, I think it could talk to you. And here you can see my two endpoints. Products for products and products id for my product. Now I assemble everything through Docker compose. So this is the Docker compose file. I'm using Yeager. Yeager is available in multiple triggers. You can have different containers. For example, here I'm using the all in one. So I'm using the batteries included package docker image in this case. So I can already have the opentelemetry collector provided by Jaeger. So it's not the open telemetry collector, it's the Jaeger collector that allows an open telemetry interface. So here I don't need to think about the architecture of Jaeger, I'm just using the docker image that does everything. Then I'm using API six because I want to protect my services. Then I have the catalog, which is the spring boot Kotlin application that I've just shown. And here I need to tell several configuration parameters. The first one is where does the Jada agent need to send the data? Well, to Jaeger on this port. How will it flag this component here it will be called orders, which is bring, it should be catalog. Then here does I want to export metrics here I said no, I don't want. Of course, depending what you want to do, you can also export metrics and logs, and logs, the same. And now pricing, I do the same for the python application, but again, this is not relevant for this talk. So here, pricing, same stuff for the python application, not relevant for this talk. Stock, same thing for the rust application, not relevant for this talk. Let's start this architecture. So it might take a bit of time, especially with the GVM to start. I will just speed the time and let's go very fast. Okay, the logs tell us that it has started. We can check with Docker Ps, Docker Ps. So here it seems that everything has started. We've got the catalog, the tracing, the stock and Jaeger. Now we can issue our first curl. So curl, I will be using the header that I have configured, Apache API 64. So if I remember it's hotel key, then I can say, let's say hello world, because I have no imagination and I'm on localhost 90 80, which is Apache API six, default port, and we'd say products. So it takes a bit of time because it will go through all our systems. So here you can already see that the catalog has taken, then none of the stocks, the pricing and Apache API six as well. So we've got the response, which is not very interesting, has it is, but it still gives you the data, you ask. Now the idea is to check the traces, so I'm now on like the Jaeger web Ui, and I can check and I will go exactly here. And here we can see all the microservices, so there are some traces here I need to refresh, because here I need to find the traces, and here we have single requests, because we sample everything. Here we have it. And we can see already with only auto instrumentation, a lot of interesting data. So we can see that we have our API six, which is the entry point, and here we have the orders, which I misnamed, it should be called catalog, but here it's orders. And here we have the product, here we have the first auto instrumentation inside of product, because we are using bring boot. We have lots of proxies inside, you know how bring boot works? And here decided, hey, here I will make a call for a proxy, I will trace it. Here we have the final why? Because it's an interface provided by Springboot. We didn't provide the implementation. So basically again, it's a proxy, so it's automatically traced. We can see here that there is a call to the other components called stock, so it's traced as well, which is good. And here we've got the second one. So here there is one for stock and one for pricing. And here we see that in one case I went directly to the component and the other one went through the API gateway. Both are completely possible. This is how I configured it inside, sorry, my architecture, basically in one case you can say I want to protect everything that I always need to get back to the API gateway to do some authentication, authorization, whatever you want. And the other side you say oh, I'm pretty secure, I can directly go through it, but it gives you insight into your architecture as well. In case you misconfigure something, you can check it through the traces, something interesting as well. We can see that the get calls to the stock and the pricing, they are made in parallel because we use coroutine. So this is also a good way to check that you actually coded your stuff correctly. If you see one going after the other, then probably your code was not right. Though tracing is not made to do that. You can also validate some of the come. And then as I mentioned, it's not good to do that built here. For each of them I go to the stock and the tracing stock and pricing, stock and pricing, and we can check for example that on Apeshi API six I actually add the additional stuff that I sent. So basically the routes and the get and here I'm missing the hotel stuff. So probably I didn't use the right one, but believe me, it should work. Now that also that already give us come information about our flow. But we might want to better, we might want for example on the get to say like which internal, before which internal method did we call which parameter? So let's do it. So now I want manual instrumentation. It means I need to explicitly couple myself to the library. So here, because I'm using spring, as I said, I want annotations. I don't want to have API calls. Actually if you check the documentation of the opentelemetry in Java to get an exporter is not that fun, requires a lot of API calls. And well, I have a notation, spring boot is compatible with opentelemetry. So let's use it. So I've added this additional dependency in my code and now we can check the application itself, code itself and we can go here and here we can see that I've added like here this with span, so this with span means that it will be instrumented and you will find it in the trace. So I should have these product handler products. If I'm calling one single product, I will have this one, but it's also possible to use additional details. So for example here I will have this product handle fetch, but I also say hey, here not only capture this call, capture this id. So which product id will I fetch? Which means that here it's interesting because normally I shouldn't need it. Here you see that the id parameter is not used because I already have the product, but because I want to capture the id I need to separate this parameter so I wouldn't be able to capture the span attribute product because then I could have not the id but the whole memory reference, unless I create a two string whatever, which is not a great idea. So here I change my method signature a bit to explicitly pass the id and then, well, I don't use it, but then it means that this will be captured by the tracing, by bring tools and I will find it, which might be super useful, especially if it fails. So normally now everything could have started and we can try again with this configuration. So here it's the same request. I've just changed the header because I missed the previous header. It was not hotel, it was ot. So let's run this again. We can check that everything works on the logging side. So here I'm in the catalog, then I'm in the stock, the pricing, whatever. I've got the response and we can check back on the Yeager UI how it looks like we expect more details. So first we can already see that we have more spans than before. Just to check on the Apache API six side, we can see that now my ot key, this header has been logged, which is good. And then we can see that I have the product here, I have the fetch here. So basically we added additional data. So inside the components we added a couple more spans to understand how the flow of the code went inside the components, not only through the outs, across components. You can also see that I did the same in Python, so if you are interested you can check the code. So here I'm logging the query like manually unfortunately, but I'm logging the query so you can have additional information what you are doing. So thanks for your attention. I hope you learned something. I showed you how you could use open telemetry, how you could use auto instrumentation, how you could use manual instrumentation, and I believe now you can start your journey. You can follow me on Twitter, you can follow me on Mastodon if you are on Mastodon. I previously wrote a blog post about opentelemetry. It's much more narrow focused. I've improved the demo code a lot, but perhaps you can read the blog post. It might give you some insight if you are interested about everything. So the python, the rust stuff, everything is on GitHub. I will be very happy if you check it and if you store it just to check there is a bitly URL. So basically I can see how many people were interested in the code. And though the talk was not about Apache API six, if got you interested in Apache API six, then you're welcome to greet us and have a look. So thanks again for attention and I wish you a great end of the day.
...

Nicolas Frankel

@ Apache APISIX

Nicolas Frankel's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)