Transcript
            
            
              This transcript was autogenerated. To make changes, submit a PR.
            
            
            
            
              Hi, my name's Justin Shear and I'm accompanied by
            
            
            
              Justin Snyder. We both belong to Northwestern Mutual,
            
            
            
              and today we're going to talk about restifying open telemetry.
            
            
            
              As I said, I'm Justin Scheer. I'm a lead software engineer at Northwestern Mutual.
            
            
            
              I belong to an internal consulting group where we help move tech
            
            
            
              initiatives forward. We basically provide acceleration
            
            
            
              to moving a bunch of things forward.
            
            
            
              Yep, and I'm Justin Snyder. I am a software engineering manager on one of our
            
            
            
              cloud native applications that regularly works with the
            
            
            
              team that Justin is on to pilot and proof of concepts some of those technology
            
            
            
              initiatives.
            
            
            
              So, as I say in the beginning, we're going to be talking about rectifying
            
            
            
              open telemetry. But for those that maybe are not familiar with
            
            
            
              open telemetry, or you're familiar with it because it's been
            
            
            
              used around your APM tooling, but maybe you don't really know
            
            
            
              too much in depth about it. Open telemetry. It's an
            
            
            
              open source both protocol set of SDKs for all
            
            
            
              of your telemetry pieces of data.
            
            
            
              So your usual traces and metrics
            
            
            
              allows you to easily ingest those into your APM tooling. But then
            
            
            
              also your, I would say, non typical telemetry
            
            
            
              toolings, at least in the past few years.
            
            
            
              So allow you to easily ingest your logs into your APM tooling.
            
            
            
              And just because of the helpful open search,
            
            
            
              we can now even get code profiling loaded
            
            
            
              into APM. It's very open source and extensible.
            
            
            
              We'll showcase how we kind of did that with our tooling.
            
            
            
              And also it's something you could just pull down yourself and
            
            
            
              create your own tooling like we did. It supports a lot of popular
            
            
            
              languages, so everything from like C, C sharp, Java to,
            
            
            
              you know, maybe you're a little bit more not used
            
            
            
              as much tooling. So stuff like Erlang or Elixir, and there's
            
            
            
              popular integrations with all of your favorite cloud native
            
            
            
              tooling to even stuff that maybe is more tailored for on premise systems.
            
            
            
              Yeah,
            
            
            
              but with all of that said, we laid out that
            
            
            
              it's this open source piece of tooling, that there's
            
            
            
              integrations across the board, but there's still many
            
            
            
              issues facing the open telemetry ecosystem.
            
            
            
              It's very easy for developers like Justin
            
            
            
              and myself to easily put this stuff in. We're in code almost every
            
            
            
              day, so for us, it seems
            
            
            
              very simple. Oh, I just add these couple of lines and everything's done.
            
            
            
              But for anyone else that looks at it, it's arcane, it's something
            
            
            
              that they're just not going to understand. And to me I
            
            
            
              should say to us, this is a limitation of open
            
            
            
              telemetry. Number two is the integration
            
            
            
              into legacy technologies. It's very sparse. If you are interested,
            
            
            
              you are mainframe users. Check out open mainframe. They have some tips
            
            
            
              and tricks to get this loaded into your mainframe systems,
            
            
            
              but there's still plenty of legacy tooling out there that
            
            
            
              is not mainframe. I mean, the first one that comes to my mind
            
            
            
              right now is actionscript. It's something that Adobe
            
            
            
              basically said, we're not going to support this thing anymore, we're ripping it all out.
            
            
            
              But there are companies out there that still use actionscript and they
            
            
            
              are basically APM tool lists into that entire
            
            
            
              ecosystem. Then finally manual and
            
            
            
              human processes. You have everything from someone needing to accept or
            
            
            
              decline some type of system.
            
            
            
              I think personally a four eye system. So one
            
            
            
              person says, yeah, this is good. They need to then send it off to a
            
            
            
              manager, and they also need to accept and decline. How do you add that into
            
            
            
              your APM tooling? There are solutions out there,
            
            
            
              but what if we could use the open source ecosystem that
            
            
            
              has been built out? This led to the question that Justin
            
            
            
              and I thought of, how could we provide a simple to use system
            
            
            
              to develop traces for all of this data that's out
            
            
            
              there and all of the use cases or problems that we saw.
            
            
            
              So the big problem that we saw with trying to integrate all of these use
            
            
            
              cases was it's not very easy to ingest
            
            
            
              your data with these ingestion APIs in the absence of
            
            
            
              an instrumentation module that you can easily include in your codebase.
            
            
            
              The solution that we came up with was obfuscating that instrumentation
            
            
            
              behind an open interface that more processes can more readily
            
            
            
              interact with an API service or a module.
            
            
            
              This particular implementation solves two, three ish
            
            
            
              pretty large issues that come up with the attempt
            
            
            
              to ingest data through the ingestion API.
            
            
            
              First, the ingestion API calls for some pretty complicated
            
            
            
              request formatting that a lot of processes just might not even
            
            
            
              be able to format their request into to be
            
            
            
              able to send that request off. So protocol buffers and complicated
            
            
            
              API responses make interacting with the API difficult.
            
            
            
              This handles that by obfuscating that interaction. And finally,
            
            
            
              to be able to interact with that API, you also need to maintain an
            
            
            
              internal set of schemas that open telemetry has produced
            
            
            
              for everybody to use when they're interacting with an ingestion API.
            
            
            
              These get updated over time, and maintaining those updates within
            
            
            
              your own code base is with another level of maintenance activity
            
            
            
              that you have to do to be able to maintain the traces being pushed up
            
            
            
              to that ingestion API. By obfuscating this,
            
            
            
              you solve these issues before this obfuscation.
            
            
            
              What it looks like is you have your node JS application and in some cases
            
            
            
              you might have some instrumentation that you can easily pull in
            
            
            
              that predefines where your traces are coming from and what you're capturing.
            
            
            
              So if you want to have custom traces outside of the basic
            
            
            
              instrumentation, you have to do all of this instrumentation yourself to
            
            
            
              be able to get that to the backend. With this implementation,
            
            
            
              you now get the instrumentation and the connection to
            
            
            
              your observability backend broken off as its own single point
            
            
            
              of ingestion. Now you have a more
            
            
            
              simple interface that all of these other processes can interact with
            
            
            
              and push that data up in the form of traces and spans. So you
            
            
            
              can do node JS applications or whatever programming language you use
            
            
            
              that has the instrumentation or custom traces. You can
            
            
            
              do manual human processes or CI processes.
            
            
            
              Basically anything that can send out an HTTP request now
            
            
            
              is now able to send trace data to your observability
            
            
            
              back ends with
            
            
            
              that architecture in mind. This is introducing what we
            
            
            
              developed and what we're really here to showcase. And it's the Otel rest trace service.
            
            
            
              And I really want to really quick, before we move on,
            
            
            
              highlight some of the nomenclature
            
            
            
              Justin just spoke of, of this trace pusher.
            
            
            
              There is an excellent GitHub project out there that is literally called
            
            
            
              Trace Pusher. And I will say that was the kind of
            
            
            
              the start for us. We started off utilizing it.
            
            
            
              There were some things that we wanted to extend on
            
            
            
              it, but also for us to internally understand. And that's really
            
            
            
              what bred our rush trace service project. So thank you
            
            
            
              to the developer that wrote Trace Pusher and for those that also
            
            
            
              used it as their inspiration and their thought process
            
            
            
              in building their own tooling,
            
            
            
              we showcase this, we showcase the problems.
            
            
            
              We showed the architecture of what the restraint service would be
            
            
            
              really. Let's highlight those use cases. Again, I really want
            
            
            
              this to be a point that people think about, because it's something
            
            
            
              that we maybe don't think needs to
            
            
            
              be in our APM tooling, but it does.
            
            
            
              So the first really big use case is
            
            
            
              hooking into technology that just doesn't have an SDK built out for it.
            
            
            
              By utilizing simple rest calls, we're able to
            
            
            
              now to hook into potentially languages that don't utilize or
            
            
            
              don't have an SDK built out yet for it. Or number two,
            
            
            
              for very esoteric technologies that maybe no
            
            
            
              one really is using and was in house built at your tech
            
            
            
              firm. If you're able to make a simple rest call,
            
            
            
              you can use this. Number two, manual processes
            
            
            
              can now be tracked. I talked about that for I system, but delivery
            
            
            
              services, phone call for by support. All of these
            
            
            
              systems now could be loaded into your APM tooling.
            
            
            
              You now don't just have your DevOps loaded into APM,
            
            
            
              you also have bizops loaded in. Number three,
            
            
            
              no code solutions. So a lot of startups,
            
            
            
              even larger companies, they need to PoC something or create an MVP,
            
            
            
              they turn to a no code solution. Well, you lost
            
            
            
              some observability because you utilize that no code solution.
            
            
            
              Maybe your no code solution has an APM that's provided, but it's not the one
            
            
            
              you want to utilize. Well now with your no code solution you
            
            
            
              can make some rest calls. You now have your observability added into
            
            
            
              a no code. And finally, CI CD pipelines, you use
            
            
            
              Jenkins, you use GitLab, you use GitHub, Circleci,
            
            
            
              all of these tooling maybe? Yeah. Again, there's a
            
            
            
              proprietary way that they can ingest it in. Well now you can get it
            
            
            
              again through simple rest calls.
            
            
            
              So to demonstrate some of those use cases, we're going to go through a
            
            
            
              couple of code level demos that just go through very simple
            
            
            
              processes that help demonstrate how you can easily just
            
            
            
              inject a couple of lines of code to make those calls, to generate traces and
            
            
            
              spans and add as much information as you could possibly shove
            
            
            
              into the one trace as you can. First is going to be
            
            
            
              the node JS implementation. As you can see here, we're going to be working within
            
            
            
              an express implementation. There is open telemetry
            
            
            
              express implementation or instrumentation
            
            
            
              already available in the form of a package. But the idea
            
            
            
              here is to present that we can actually extend that instrumentation
            
            
            
              by placing trace creation
            
            
            
              or traces throughout our code strategically
            
            
            
              where it makes more sense for our processes.
            
            
            
              Within this we have a particular API that
            
            
            
              handles a particular business logic process,
            
            
            
              very generic, very complicated. I know in
            
            
            
              this we can create a trace that tracks that entire process
            
            
            
              and then has spans within each sub element of that process.
            
            
            
              Dive deeper into each of those pieces.
            
            
            
              When we enter in this API, we make our first call, which is to just
            
            
            
              start the trace. And when we make that call, note that we do
            
            
            
              pull out a trace id from that. That is very important because the
            
            
            
              main requirement of using this tool is just making sure that that trace id
            
            
            
              is available everywhere in your process. And beyond that it's just
            
            
            
              making those HTTP requests with that trace id in mute.
            
            
            
              So we start the trace and then we go into our first bit
            
            
            
              of logic, business logic a.
            
            
            
              But we make that function call. We pass in trace id as
            
            
            
              a parameter to that function to make sure that trace id is available to
            
            
            
              that new context that we're shifting to. And similarly,
            
            
            
              as we go into this sub process, we start a span with
            
            
            
              that span. The span id is the return call or return from that
            
            
            
              call. And the same logic applies where we got to make sure that that span
            
            
            
              is available within the entire context where we want to make the span.
            
            
            
              Luckily it's a lot simpler for spans because spans usually are contained in
            
            
            
              a single context. But that doesn't have to be the case. This is very flexible.
            
            
            
              As long as you can access that span id and make the request, you can
            
            
            
              make that as large or small as you want. So we start the span,
            
            
            
              we inject some information about the span that we're creating.
            
            
            
              So we have a name. We're adding a couple of properties which will manifest as
            
            
            
              attributes in the observability backend. And then we do our business
            
            
            
              process process part one, follow the steps, we log out
            
            
            
              the message, and then we close the span to signify that this
            
            
            
              particular sub process has been completed. And then we return our function.
            
            
            
              So we go back to that main operating context
            
            
            
              and we move into the next function. The next function is going
            
            
            
              to look eerily similar to business process part
            
            
            
              one, but it's part two. And we do the exact same things where we
            
            
            
              start up another span. To capture this second sub process,
            
            
            
              we close that span and then we return to the main function.
            
            
            
              Then in that main function, now that we have gone through our entire business
            
            
            
              logic, it is time to end that trace. When we make this final
            
            
            
              call to the trace or the rest trace service end
            
            
            
              the trace, it'll now package that all up and send it up to your observability
            
            
            
              backend. Then at this point, that information from
            
            
            
              the end to end trace and with all of its fans will now be
            
            
            
              viewable in whatever observability backend you're using.
            
            
            
              Dustin and I, we use dynatrace within our day to day.
            
            
            
              So what you're seeing is a dynatrace trace and the
            
            
            
              two spans. So at the top you can see the nice bundling of
            
            
            
              all of those spans. The timing, how long one or business process
            
            
            
              part one took, how long business process part two took.
            
            
            
              And then you can drill down into each of those fans to find those attributes
            
            
            
              that you specified within that call. So if there's any information
            
            
            
              that you want to make sure gets tracked within the observability
            
            
            
              backend. This is where you'll be able to view that. And most backends
            
            
            
              have ways that you can query on these attributes and you can do a lot
            
            
            
              of cool visualizations and monitoring and alerting off of them.
            
            
            
              So as Justin mentioned, there's more use cases than just APIs and services.
            
            
            
              You can actually interact with non standard trace services
            
            
            
              like your CI CD. So again we use GitLab CI. So the,
            
            
            
              so the context of this particular code level demo is going to be in that,
            
            
            
              but this would apply to any of the CI frameworks that Justin
            
            
            
              mentioned, as well as any like human processes.
            
            
            
              Again, as long as you can make that HTTP request.
            
            
            
              So I'm going to go through this one a little bit faster because the
            
            
            
              thought process is pretty much the same. But I essentially
            
            
            
              you start your trace in your initial free stage
            
            
            
              of your pipeline. You start that trace, you put the trace
            
            
            
              id into a file that gets passed between each of these jobs to
            
            
            
              make sure that you can access it and continue to update that trace.
            
            
            
              So you get into the first step of your pipeline, which is a build.
            
            
            
              You add that span to that trace, you do your build
            
            
            
              and then you close that span, and then you continue on to the next
            
            
            
              step of your pipeline, which is the test. Exact same
            
            
            
              thing as the build where you do your Spanish, execute your
            
            
            
              test, and then you create your afterscript that has all of the information about
            
            
            
              how your test panned out. One thing to note though,
            
            
            
              right in here it says that test is complete. Doesn't add any context
            
            
            
              about whether or not the test was successful or failed. But in this
            
            
            
              span closure, you could actually change the success
            
            
            
              or failure status of that span. You can inject a ton of information
            
            
            
              into this particular context is dependent on the previous execution,
            
            
            
              so there's a lot of flexibility in how you can use this tool.
            
            
            
              Then finally, as the final step of our pipeline, we close
            
            
            
              that trace and then the rest trace service will push that off to your
            
            
            
              observability backend where you can see the trace yet again.
            
            
            
              And this looks exactly the same as the other trace that
            
            
            
              we just looked at. And that's awesome, because with
            
            
            
              this then what you can start to do is you can link these traces together
            
            
            
              from all of these disparate contexts and have a full end to
            
            
            
              end picture of all of these processes working together,
            
            
            
              even if each of those processes live in very different contexts.
            
            
            
              That's fine. You can now visualize all of this by linking on this
            
            
            
              shared data across all of these traces. So we've
            
            
            
              looked at a couple of things that we can do with this tool, how flexible
            
            
            
              it is, but there's still a little bit more to do.
            
            
            
              Justin's going to talk about that. Yep. So what are the
            
            
            
              next steps for our tool? Well, we need to standardize
            
            
            
              around trace creation right now within our organization.
            
            
            
              Everyone's doing it their own way because we're still exploring it.
            
            
            
              Well, we need to start standardizing around what needs
            
            
            
              to be a trace, what doesn't need to be a trace, what goes into the
            
            
            
              APM tooling, what doesn't go into the APM tooling. Number two
            
            
            
              is ease of use. Right now, a lot of the properties
            
            
            
              that you send via the rest calls, they're one to one mappings to
            
            
            
              otelkeys. Some of these make sense, but some of them, we should
            
            
            
              probably change the language a little bit to help those that
            
            
            
              are not maybe familiar with the open telemetry tooling,
            
            
            
              and also to make sure that we're not all saying our own
            
            
            
              thing with it. We don't need to go into our APM and have a million
            
            
            
              different attributes because we didn't standardize on
            
            
            
              what we were going to do. Then finally, we're looking at
            
            
            
              open sourcing our tooling. We're going to be working with our
            
            
            
              organization and trying and hopefully bring this out to everyone so that
            
            
            
              way you can use it and hopefully you can help us out by
            
            
            
              adding on new use cases.
            
            
            
              And I like to say thank you. This was our talk, our pretty quick
            
            
            
              talk on the rest tooling, and kind of our start
            
            
            
              into this whole ecosystem. Any other final words,
            
            
            
              Justin? No. Just want to say that it was a lot of fun getting to
            
            
            
              this point. I'm really excited to see the future of this tool and the open
            
            
            
              telemetry project as a whole. Thanks.
            
            
            
              Thanks,