Conf42 Observability 2024 - Online

Tracing For Everything - RESTifying OpenTelemetry Traces

Abstract

OpenTelemetry is an easy way to get instant observability in your system. But, it is not available for everything. What if it were and it was as simple as calling a service? With our new OpenTelemetry REST service, it is now as simple as calling a couple of endpoints!

Summary

  • Today we're going to talk about restifying open telemetry. It's an open source both protocol set of SDKs for all of your telemetry pieces of data. But there's still many issues facing the open Telemetry ecosystem. For example, integration into legacy technologies.
  • The idea here is to extend that instrumentation by placing trace creation or traces throughout our code strategically where it makes more sense for our processes. We can create a trace that tracks that entire process and then has spans within each sub element of that process. That information from the end to end trace will now be viewable in whatever observability backend you're using.
  • So what are the next steps for our tool? Well, we need to standardize around trace creation. Number two is ease of use. Then finally, we're looking at open sourcing our tooling.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi, my name's Justin Shear and I'm accompanied by Justin Snyder. We both belong to Northwestern Mutual, and today we're going to talk about restifying open telemetry. As I said, I'm Justin Scheer. I'm a lead software engineer at Northwestern Mutual. I belong to an internal consulting group where we help move tech initiatives forward. We basically provide acceleration to moving a bunch of things forward. Yep, and I'm Justin Snyder. I am a software engineering manager on one of our cloud native applications that regularly works with the team that Justin is on to pilot and proof of concepts some of those technology initiatives. So, as I say in the beginning, we're going to be talking about rectifying open telemetry. But for those that maybe are not familiar with open telemetry, or you're familiar with it because it's been used around your APM tooling, but maybe you don't really know too much in depth about it. Open telemetry. It's an open source both protocol set of SDKs for all of your telemetry pieces of data. So your usual traces and metrics allows you to easily ingest those into your APM tooling. But then also your, I would say, non typical telemetry toolings, at least in the past few years. So allow you to easily ingest your logs into your APM tooling. And just because of the helpful open search, we can now even get code profiling loaded into APM. It's very open source and extensible. We'll showcase how we kind of did that with our tooling. And also it's something you could just pull down yourself and create your own tooling like we did. It supports a lot of popular languages, so everything from like C, C sharp, Java to, you know, maybe you're a little bit more not used as much tooling. So stuff like Erlang or Elixir, and there's popular integrations with all of your favorite cloud native tooling to even stuff that maybe is more tailored for on premise systems. Yeah, but with all of that said, we laid out that it's this open source piece of tooling, that there's integrations across the board, but there's still many issues facing the open telemetry ecosystem. It's very easy for developers like Justin and myself to easily put this stuff in. We're in code almost every day, so for us, it seems very simple. Oh, I just add these couple of lines and everything's done. But for anyone else that looks at it, it's arcane, it's something that they're just not going to understand. And to me I should say to us, this is a limitation of open telemetry. Number two is the integration into legacy technologies. It's very sparse. If you are interested, you are mainframe users. Check out open mainframe. They have some tips and tricks to get this loaded into your mainframe systems, but there's still plenty of legacy tooling out there that is not mainframe. I mean, the first one that comes to my mind right now is actionscript. It's something that Adobe basically said, we're not going to support this thing anymore, we're ripping it all out. But there are companies out there that still use actionscript and they are basically APM tool lists into that entire ecosystem. Then finally manual and human processes. You have everything from someone needing to accept or decline some type of system. I think personally a four eye system. So one person says, yeah, this is good. They need to then send it off to a manager, and they also need to accept and decline. How do you add that into your APM tooling? There are solutions out there, but what if we could use the open source ecosystem that has been built out? This led to the question that Justin and I thought of, how could we provide a simple to use system to develop traces for all of this data that's out there and all of the use cases or problems that we saw. So the big problem that we saw with trying to integrate all of these use cases was it's not very easy to ingest your data with these ingestion APIs in the absence of an instrumentation module that you can easily include in your codebase. The solution that we came up with was obfuscating that instrumentation behind an open interface that more processes can more readily interact with an API service or a module. This particular implementation solves two, three ish pretty large issues that come up with the attempt to ingest data through the ingestion API. First, the ingestion API calls for some pretty complicated request formatting that a lot of processes just might not even be able to format their request into to be able to send that request off. So protocol buffers and complicated API responses make interacting with the API difficult. This handles that by obfuscating that interaction. And finally, to be able to interact with that API, you also need to maintain an internal set of schemas that open telemetry has produced for everybody to use when they're interacting with an ingestion API. These get updated over time, and maintaining those updates within your own code base is with another level of maintenance activity that you have to do to be able to maintain the traces being pushed up to that ingestion API. By obfuscating this, you solve these issues before this obfuscation. What it looks like is you have your node JS application and in some cases you might have some instrumentation that you can easily pull in that predefines where your traces are coming from and what you're capturing. So if you want to have custom traces outside of the basic instrumentation, you have to do all of this instrumentation yourself to be able to get that to the backend. With this implementation, you now get the instrumentation and the connection to your observability backend broken off as its own single point of ingestion. Now you have a more simple interface that all of these other processes can interact with and push that data up in the form of traces and spans. So you can do node JS applications or whatever programming language you use that has the instrumentation or custom traces. You can do manual human processes or CI processes. Basically anything that can send out an HTTP request now is now able to send trace data to your observability back ends with that architecture in mind. This is introducing what we developed and what we're really here to showcase. And it's the Otel rest trace service. And I really want to really quick, before we move on, highlight some of the nomenclature Justin just spoke of, of this trace pusher. There is an excellent GitHub project out there that is literally called Trace Pusher. And I will say that was the kind of the start for us. We started off utilizing it. There were some things that we wanted to extend on it, but also for us to internally understand. And that's really what bred our rush trace service project. So thank you to the developer that wrote Trace Pusher and for those that also used it as their inspiration and their thought process in building their own tooling, we showcase this, we showcase the problems. We showed the architecture of what the restraint service would be really. Let's highlight those use cases. Again, I really want this to be a point that people think about, because it's something that we maybe don't think needs to be in our APM tooling, but it does. So the first really big use case is hooking into technology that just doesn't have an SDK built out for it. By utilizing simple rest calls, we're able to now to hook into potentially languages that don't utilize or don't have an SDK built out yet for it. Or number two, for very esoteric technologies that maybe no one really is using and was in house built at your tech firm. If you're able to make a simple rest call, you can use this. Number two, manual processes can now be tracked. I talked about that for I system, but delivery services, phone call for by support. All of these systems now could be loaded into your APM tooling. You now don't just have your DevOps loaded into APM, you also have bizops loaded in. Number three, no code solutions. So a lot of startups, even larger companies, they need to PoC something or create an MVP, they turn to a no code solution. Well, you lost some observability because you utilize that no code solution. Maybe your no code solution has an APM that's provided, but it's not the one you want to utilize. Well now with your no code solution you can make some rest calls. You now have your observability added into a no code. And finally, CI CD pipelines, you use Jenkins, you use GitLab, you use GitHub, Circleci, all of these tooling maybe? Yeah. Again, there's a proprietary way that they can ingest it in. Well now you can get it again through simple rest calls. So to demonstrate some of those use cases, we're going to go through a couple of code level demos that just go through very simple processes that help demonstrate how you can easily just inject a couple of lines of code to make those calls, to generate traces and spans and add as much information as you could possibly shove into the one trace as you can. First is going to be the node JS implementation. As you can see here, we're going to be working within an express implementation. There is open telemetry express implementation or instrumentation already available in the form of a package. But the idea here is to present that we can actually extend that instrumentation by placing trace creation or traces throughout our code strategically where it makes more sense for our processes. Within this we have a particular API that handles a particular business logic process, very generic, very complicated. I know in this we can create a trace that tracks that entire process and then has spans within each sub element of that process. Dive deeper into each of those pieces. When we enter in this API, we make our first call, which is to just start the trace. And when we make that call, note that we do pull out a trace id from that. That is very important because the main requirement of using this tool is just making sure that that trace id is available everywhere in your process. And beyond that it's just making those HTTP requests with that trace id in mute. So we start the trace and then we go into our first bit of logic, business logic a. But we make that function call. We pass in trace id as a parameter to that function to make sure that trace id is available to that new context that we're shifting to. And similarly, as we go into this sub process, we start a span with that span. The span id is the return call or return from that call. And the same logic applies where we got to make sure that that span is available within the entire context where we want to make the span. Luckily it's a lot simpler for spans because spans usually are contained in a single context. But that doesn't have to be the case. This is very flexible. As long as you can access that span id and make the request, you can make that as large or small as you want. So we start the span, we inject some information about the span that we're creating. So we have a name. We're adding a couple of properties which will manifest as attributes in the observability backend. And then we do our business process process part one, follow the steps, we log out the message, and then we close the span to signify that this particular sub process has been completed. And then we return our function. So we go back to that main operating context and we move into the next function. The next function is going to look eerily similar to business process part one, but it's part two. And we do the exact same things where we start up another span. To capture this second sub process, we close that span and then we return to the main function. Then in that main function, now that we have gone through our entire business logic, it is time to end that trace. When we make this final call to the trace or the rest trace service end the trace, it'll now package that all up and send it up to your observability backend. Then at this point, that information from the end to end trace and with all of its fans will now be viewable in whatever observability backend you're using. Dustin and I, we use dynatrace within our day to day. So what you're seeing is a dynatrace trace and the two spans. So at the top you can see the nice bundling of all of those spans. The timing, how long one or business process part one took, how long business process part two took. And then you can drill down into each of those fans to find those attributes that you specified within that call. So if there's any information that you want to make sure gets tracked within the observability backend. This is where you'll be able to view that. And most backends have ways that you can query on these attributes and you can do a lot of cool visualizations and monitoring and alerting off of them. So as Justin mentioned, there's more use cases than just APIs and services. You can actually interact with non standard trace services like your CI CD. So again we use GitLab CI. So the, so the context of this particular code level demo is going to be in that, but this would apply to any of the CI frameworks that Justin mentioned, as well as any like human processes. Again, as long as you can make that HTTP request. So I'm going to go through this one a little bit faster because the thought process is pretty much the same. But I essentially you start your trace in your initial free stage of your pipeline. You start that trace, you put the trace id into a file that gets passed between each of these jobs to make sure that you can access it and continue to update that trace. So you get into the first step of your pipeline, which is a build. You add that span to that trace, you do your build and then you close that span, and then you continue on to the next step of your pipeline, which is the test. Exact same thing as the build where you do your Spanish, execute your test, and then you create your afterscript that has all of the information about how your test panned out. One thing to note though, right in here it says that test is complete. Doesn't add any context about whether or not the test was successful or failed. But in this span closure, you could actually change the success or failure status of that span. You can inject a ton of information into this particular context is dependent on the previous execution, so there's a lot of flexibility in how you can use this tool. Then finally, as the final step of our pipeline, we close that trace and then the rest trace service will push that off to your observability backend where you can see the trace yet again. And this looks exactly the same as the other trace that we just looked at. And that's awesome, because with this then what you can start to do is you can link these traces together from all of these disparate contexts and have a full end to end picture of all of these processes working together, even if each of those processes live in very different contexts. That's fine. You can now visualize all of this by linking on this shared data across all of these traces. So we've looked at a couple of things that we can do with this tool, how flexible it is, but there's still a little bit more to do. Justin's going to talk about that. Yep. So what are the next steps for our tool? Well, we need to standardize around trace creation right now within our organization. Everyone's doing it their own way because we're still exploring it. Well, we need to start standardizing around what needs to be a trace, what doesn't need to be a trace, what goes into the APM tooling, what doesn't go into the APM tooling. Number two is ease of use. Right now, a lot of the properties that you send via the rest calls, they're one to one mappings to otelkeys. Some of these make sense, but some of them, we should probably change the language a little bit to help those that are not maybe familiar with the open telemetry tooling, and also to make sure that we're not all saying our own thing with it. We don't need to go into our APM and have a million different attributes because we didn't standardize on what we were going to do. Then finally, we're looking at open sourcing our tooling. We're going to be working with our organization and trying and hopefully bring this out to everyone so that way you can use it and hopefully you can help us out by adding on new use cases. And I like to say thank you. This was our talk, our pretty quick talk on the rest tooling, and kind of our start into this whole ecosystem. Any other final words, Justin? No. Just want to say that it was a lot of fun getting to this point. I'm really excited to see the future of this tool and the open telemetry project as a whole. Thanks. Thanks,
...

Justin Scherer

Lead Software Engineer @ Northwestern Mutual

Justin Scherer's LinkedIn account

Justin Snyder

Manager DevOps Engineering @ Northwestern Mutual

Justin Snyder's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways