Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, my name's Justin Shear and I'm accompanied by
Justin Snyder. We both belong to Northwestern Mutual,
and today we're going to talk about restifying open telemetry.
As I said, I'm Justin Scheer. I'm a lead software engineer at Northwestern Mutual.
I belong to an internal consulting group where we help move tech
initiatives forward. We basically provide acceleration
to moving a bunch of things forward.
Yep, and I'm Justin Snyder. I am a software engineering manager on one of our
cloud native applications that regularly works with the
team that Justin is on to pilot and proof of concepts some of those technology
initiatives.
So, as I say in the beginning, we're going to be talking about rectifying
open telemetry. But for those that maybe are not familiar with
open telemetry, or you're familiar with it because it's been
used around your APM tooling, but maybe you don't really know
too much in depth about it. Open telemetry. It's an
open source both protocol set of SDKs for all
of your telemetry pieces of data.
So your usual traces and metrics
allows you to easily ingest those into your APM tooling. But then
also your, I would say, non typical telemetry
toolings, at least in the past few years.
So allow you to easily ingest your logs into your APM tooling.
And just because of the helpful open search,
we can now even get code profiling loaded
into APM. It's very open source and extensible.
We'll showcase how we kind of did that with our tooling.
And also it's something you could just pull down yourself and
create your own tooling like we did. It supports a lot of popular
languages, so everything from like C, C sharp, Java to,
you know, maybe you're a little bit more not used
as much tooling. So stuff like Erlang or Elixir, and there's
popular integrations with all of your favorite cloud native
tooling to even stuff that maybe is more tailored for on premise systems.
Yeah,
but with all of that said, we laid out that
it's this open source piece of tooling, that there's
integrations across the board, but there's still many
issues facing the open telemetry ecosystem.
It's very easy for developers like Justin
and myself to easily put this stuff in. We're in code almost every
day, so for us, it seems
very simple. Oh, I just add these couple of lines and everything's done.
But for anyone else that looks at it, it's arcane, it's something
that they're just not going to understand. And to me I
should say to us, this is a limitation of open
telemetry. Number two is the integration
into legacy technologies. It's very sparse. If you are interested,
you are mainframe users. Check out open mainframe. They have some tips
and tricks to get this loaded into your mainframe systems,
but there's still plenty of legacy tooling out there that
is not mainframe. I mean, the first one that comes to my mind
right now is actionscript. It's something that Adobe
basically said, we're not going to support this thing anymore, we're ripping it all out.
But there are companies out there that still use actionscript and they
are basically APM tool lists into that entire
ecosystem. Then finally manual and
human processes. You have everything from someone needing to accept or
decline some type of system.
I think personally a four eye system. So one
person says, yeah, this is good. They need to then send it off to a
manager, and they also need to accept and decline. How do you add that into
your APM tooling? There are solutions out there,
but what if we could use the open source ecosystem that
has been built out? This led to the question that Justin
and I thought of, how could we provide a simple to use system
to develop traces for all of this data that's out
there and all of the use cases or problems that we saw.
So the big problem that we saw with trying to integrate all of these use
cases was it's not very easy to ingest
your data with these ingestion APIs in the absence of
an instrumentation module that you can easily include in your codebase.
The solution that we came up with was obfuscating that instrumentation
behind an open interface that more processes can more readily
interact with an API service or a module.
This particular implementation solves two, three ish
pretty large issues that come up with the attempt
to ingest data through the ingestion API.
First, the ingestion API calls for some pretty complicated
request formatting that a lot of processes just might not even
be able to format their request into to be
able to send that request off. So protocol buffers and complicated
API responses make interacting with the API difficult.
This handles that by obfuscating that interaction. And finally,
to be able to interact with that API, you also need to maintain an
internal set of schemas that open telemetry has produced
for everybody to use when they're interacting with an ingestion API.
These get updated over time, and maintaining those updates within
your own code base is with another level of maintenance activity
that you have to do to be able to maintain the traces being pushed up
to that ingestion API. By obfuscating this,
you solve these issues before this obfuscation.
What it looks like is you have your node JS application and in some cases
you might have some instrumentation that you can easily pull in
that predefines where your traces are coming from and what you're capturing.
So if you want to have custom traces outside of the basic
instrumentation, you have to do all of this instrumentation yourself to
be able to get that to the backend. With this implementation,
you now get the instrumentation and the connection to
your observability backend broken off as its own single point
of ingestion. Now you have a more
simple interface that all of these other processes can interact with
and push that data up in the form of traces and spans. So you
can do node JS applications or whatever programming language you use
that has the instrumentation or custom traces. You can
do manual human processes or CI processes.
Basically anything that can send out an HTTP request now
is now able to send trace data to your observability
back ends with
that architecture in mind. This is introducing what we
developed and what we're really here to showcase. And it's the Otel rest trace service.
And I really want to really quick, before we move on,
highlight some of the nomenclature
Justin just spoke of, of this trace pusher.
There is an excellent GitHub project out there that is literally called
Trace Pusher. And I will say that was the kind of
the start for us. We started off utilizing it.
There were some things that we wanted to extend on
it, but also for us to internally understand. And that's really
what bred our rush trace service project. So thank you
to the developer that wrote Trace Pusher and for those that also
used it as their inspiration and their thought process
in building their own tooling,
we showcase this, we showcase the problems.
We showed the architecture of what the restraint service would be
really. Let's highlight those use cases. Again, I really want
this to be a point that people think about, because it's something
that we maybe don't think needs to
be in our APM tooling, but it does.
So the first really big use case is
hooking into technology that just doesn't have an SDK built out for it.
By utilizing simple rest calls, we're able to
now to hook into potentially languages that don't utilize or
don't have an SDK built out yet for it. Or number two,
for very esoteric technologies that maybe no
one really is using and was in house built at your tech
firm. If you're able to make a simple rest call,
you can use this. Number two, manual processes
can now be tracked. I talked about that for I system, but delivery
services, phone call for by support. All of these
systems now could be loaded into your APM tooling.
You now don't just have your DevOps loaded into APM,
you also have bizops loaded in. Number three,
no code solutions. So a lot of startups,
even larger companies, they need to PoC something or create an MVP,
they turn to a no code solution. Well, you lost
some observability because you utilize that no code solution.
Maybe your no code solution has an APM that's provided, but it's not the one
you want to utilize. Well now with your no code solution you
can make some rest calls. You now have your observability added into
a no code. And finally, CI CD pipelines, you use
Jenkins, you use GitLab, you use GitHub, Circleci,
all of these tooling maybe? Yeah. Again, there's a
proprietary way that they can ingest it in. Well now you can get it
again through simple rest calls.
So to demonstrate some of those use cases, we're going to go through a
couple of code level demos that just go through very simple
processes that help demonstrate how you can easily just
inject a couple of lines of code to make those calls, to generate traces and
spans and add as much information as you could possibly shove
into the one trace as you can. First is going to be
the node JS implementation. As you can see here, we're going to be working within
an express implementation. There is open telemetry
express implementation or instrumentation
already available in the form of a package. But the idea
here is to present that we can actually extend that instrumentation
by placing trace creation
or traces throughout our code strategically
where it makes more sense for our processes.
Within this we have a particular API that
handles a particular business logic process,
very generic, very complicated. I know in
this we can create a trace that tracks that entire process
and then has spans within each sub element of that process.
Dive deeper into each of those pieces.
When we enter in this API, we make our first call, which is to just
start the trace. And when we make that call, note that we do
pull out a trace id from that. That is very important because the
main requirement of using this tool is just making sure that that trace id
is available everywhere in your process. And beyond that it's just
making those HTTP requests with that trace id in mute.
So we start the trace and then we go into our first bit
of logic, business logic a.
But we make that function call. We pass in trace id as
a parameter to that function to make sure that trace id is available to
that new context that we're shifting to. And similarly,
as we go into this sub process, we start a span with
that span. The span id is the return call or return from that
call. And the same logic applies where we got to make sure that that span
is available within the entire context where we want to make the span.
Luckily it's a lot simpler for spans because spans usually are contained in
a single context. But that doesn't have to be the case. This is very flexible.
As long as you can access that span id and make the request, you can
make that as large or small as you want. So we start the span,
we inject some information about the span that we're creating.
So we have a name. We're adding a couple of properties which will manifest as
attributes in the observability backend. And then we do our business
process process part one, follow the steps, we log out
the message, and then we close the span to signify that this
particular sub process has been completed. And then we return our function.
So we go back to that main operating context
and we move into the next function. The next function is going
to look eerily similar to business process part
one, but it's part two. And we do the exact same things where we
start up another span. To capture this second sub process,
we close that span and then we return to the main function.
Then in that main function, now that we have gone through our entire business
logic, it is time to end that trace. When we make this final
call to the trace or the rest trace service end
the trace, it'll now package that all up and send it up to your observability
backend. Then at this point, that information from
the end to end trace and with all of its fans will now be
viewable in whatever observability backend you're using.
Dustin and I, we use dynatrace within our day to day.
So what you're seeing is a dynatrace trace and the
two spans. So at the top you can see the nice bundling of
all of those spans. The timing, how long one or business process
part one took, how long business process part two took.
And then you can drill down into each of those fans to find those attributes
that you specified within that call. So if there's any information
that you want to make sure gets tracked within the observability
backend. This is where you'll be able to view that. And most backends
have ways that you can query on these attributes and you can do a lot
of cool visualizations and monitoring and alerting off of them.
So as Justin mentioned, there's more use cases than just APIs and services.
You can actually interact with non standard trace services
like your CI CD. So again we use GitLab CI. So the,
so the context of this particular code level demo is going to be in that,
but this would apply to any of the CI frameworks that Justin
mentioned, as well as any like human processes.
Again, as long as you can make that HTTP request.
So I'm going to go through this one a little bit faster because the
thought process is pretty much the same. But I essentially
you start your trace in your initial free stage
of your pipeline. You start that trace, you put the trace
id into a file that gets passed between each of these jobs to
make sure that you can access it and continue to update that trace.
So you get into the first step of your pipeline, which is a build.
You add that span to that trace, you do your build
and then you close that span, and then you continue on to the next
step of your pipeline, which is the test. Exact same
thing as the build where you do your Spanish, execute your
test, and then you create your afterscript that has all of the information about
how your test panned out. One thing to note though,
right in here it says that test is complete. Doesn't add any context
about whether or not the test was successful or failed. But in this
span closure, you could actually change the success
or failure status of that span. You can inject a ton of information
into this particular context is dependent on the previous execution,
so there's a lot of flexibility in how you can use this tool.
Then finally, as the final step of our pipeline, we close
that trace and then the rest trace service will push that off to your
observability backend where you can see the trace yet again.
And this looks exactly the same as the other trace that
we just looked at. And that's awesome, because with
this then what you can start to do is you can link these traces together
from all of these disparate contexts and have a full end to
end picture of all of these processes working together,
even if each of those processes live in very different contexts.
That's fine. You can now visualize all of this by linking on this
shared data across all of these traces. So we've
looked at a couple of things that we can do with this tool, how flexible
it is, but there's still a little bit more to do.
Justin's going to talk about that. Yep. So what are the
next steps for our tool? Well, we need to standardize
around trace creation right now within our organization.
Everyone's doing it their own way because we're still exploring it.
Well, we need to start standardizing around what needs
to be a trace, what doesn't need to be a trace, what goes into the
APM tooling, what doesn't go into the APM tooling. Number two
is ease of use. Right now, a lot of the properties
that you send via the rest calls, they're one to one mappings to
otelkeys. Some of these make sense, but some of them, we should
probably change the language a little bit to help those that
are not maybe familiar with the open telemetry tooling,
and also to make sure that we're not all saying our own
thing with it. We don't need to go into our APM and have a million
different attributes because we didn't standardize on
what we were going to do. Then finally, we're looking at
open sourcing our tooling. We're going to be working with our
organization and trying and hopefully bring this out to everyone so that
way you can use it and hopefully you can help us out by
adding on new use cases.
And I like to say thank you. This was our talk, our pretty quick
talk on the rest tooling, and kind of our start
into this whole ecosystem. Any other final words,
Justin? No. Just want to say that it was a lot of fun getting to
this point. I'm really excited to see the future of this tool and the open
telemetry project as a whole. Thanks.
Thanks,