Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello and welcome.
Today I'm going to be talking about open telemetry,
specifically on how I got open telemetry working
in a polyglot environment. Before we begin,
my name is Michael Sickles and I'm a solutions architect at Honeycombio.
And so I've been working with our customers recently setting
up their instrumentation for their environments,
using Opentelemetry to get application insights.
And that's what it's really for. Opentelemetry gives you tools,
APIs and sdks to ask questions about
your system. So seeing things like traces, metrics and
logs and why it's important and useful is
that it's a standard so that you can spend
the work to instrument and get these application insights once,
but not have to redo it later.
Typically, the old style is a lot of the vendors out there have their
own proprietary format on how to add instrumentation and
get insights. And it's frustrating when you want to either try
multiple tools or switch to a new tool because it takes work that you
have to lift and shift and rewrite code in order to try
things out. With Opentelemetry, you do that once
and then as you move along you can keep adding the opentelemetry
insights as you develop new code. You can then
point it to one or multiple vendors, to be honest.
And from there you can see which tool is
going to be best for your situation and where you're going to get the most
value. You're no longer locked in just because of all
that work you put in upfront. And so I
wanted to find a good demo and
good environment to help our customers see how to use
Opentelemetry. There's a lot of examples
on individual like here's a node app. How do you
set it up for a node app? Here's a Java app. How do you set
it up for a Java app? But there's not a lot of good tutorials or
examples on something that's more complex, something that is
distributed tracing and talking across multiple different kinds
of code systems. And so I found the CNCF Google
Microservices demo. This is a microservice on Kubernetes
and it's polyglot, so it's got no Javago net and python,
which also uses Opensensus. So Opensensus was
the way it got its telemetry insights. And Opensensus
is a standard before opentelemetry, and there was
another standard called open tracing, and Opensensus and open tracing decided to
come together. Let's unify the standard and have one. And that's
where opentelemetry came about. They merged together. So I
could take that open census instrumentation and know I
can switch it over to the open telemetry style.
There might be some semantic differences, but I could say,
look here, it's using Opensensus to get insight here.
Let's get insight in opentelemetry as well.
The application itself is on
the right and it's an ecommerce website. It allows you to find
items to add to your cart, check out,
you can convert different currencies, you can get ads. And these
are the different services that make up your application.
And what's going to happen is we're going to start at our front end,
which is written in go, and we're going to take some kind of
action. Maybe it's add to cart or check
out or see ad. These are
the traces, and I'll get more into that later
that we're going to follow through as it makes a call to a back end
service which might make calls to other back end services. And these are on different
servers and that's the different coding languages. We want to be
able to watch one call, one action, one transaction,
and see all the different pieces that that action
talks to and connects to. And what that allows us to do then is we
can see where slowness might be in the system or where there's
errors in my system. I can target and get to root cause faster with a
tool like Opentelemetry and some vendor out there.
When I was considering how do I instrument, what do I
want to do? I right away thought I'm going to reuse
that open census code like I mentioned before, it's already in place.
I can just convert it to open telemetry semantics and open
telemetry libraries. It will give us good insights. Just going to
go follow the front end to the back end. So I'll start with that front
end first, move my way to the other services it talks to. And then I'm
going to use automatic instrumenting when possible. When you're
going through and adding telemetry to your system, all the different code
languages have various ways to have automatic insights.
And that can vary from automatically hooking say to the JVM
for Java to pulling in specific wrappers
that are going to wrap the libraries themselves and say like node
js to automatically do tracing for us and automatically
starting and stopping units of work for those
libraries. So if we can do less work for us,
that's great, right? That's another benefit of opentelemetry
is we have all these different organizations working
on it where that if there is a library being used
at one organization and they can then add the instrumentation
pieces that might make its way upstream into open telemetry, then any other organizations
that might use those libraries can get insights. Finally,
though, that I'm going to want to add manual instrumentation as well. So you can
get a lot from automatic instrumentation, but it can only take you so far.
You know your system better than anyone else and no automatic
instrumentation is going to get to where you need to really understand the state of
your system because you got this auto
instrumentation, which is good, but take it to the next
level. You can add things like user details or
server details or product details that
are really going to be those nuanced differences on why
your system might be performing or breaking in different ways. And by
adding in those details you can ask more questions about your system.
And that's what we ultimately want to do is we want to understand how it's
performing and maybe for whom does it suck it in my system before
I go on a little bookkeeping here, terminology.
So a span is a unit of work.
It's an action encode, it's maybe a function, a method,
and it took some amount of time, it took
three milliseconds, 20 milliseconds. It's something we
measured and then we have a span attribute. So we're
going to add contextual details. That's that user id,
session id. We want to add the variables
in our code to the span so I could understand what was going
on in my system at that specific point in time. And then
I can add span events that is essentially a
log attached to a span, more or less. It's something that doesn't
necessarily have a duration, but is something that is
interesting. So for example, an exception, if there is an exception, we want to take
that information and attach it to this specific point in time. That span,
that unit of work, it's where it happened. But an exception doesn't trace
a certain amount of time. It's just something that happened at a point
in time. A trace then is a collection of
spans for a certain action that add to cart that
check out. That is something that your
users are doing and it's going to touch different spans as
it goes through different pieces of your code and connect them together using
a unique id, so that opentelemetry is automatically
generating that id and able to connect it across your code.
That your vendor tool choice is going to be able then to
render it some way in the UI in some kind of view to make
sense of that. If you ever see me talk about OtLP in this
presentation, OTLP is just the specific protocol and
format for opentelemetry itself. And then exporters,
exporter is where we are sending our data.
So you can export to the window, the console window,
or we can export to a vendor, we can export to
somewhere those application insights.
So I started with the front end and we can see
that I ripped out the open
census libraries.
And this code. By the way, this GitHub repo
of where I made these changes will be at the end of the presentation as
a follow up for if you want to go in and see the specific things
that I changed from open census to opentelemetry, you can
see all the changes I did using the get history and compare.
I'm not going to throw in every single little code change I did just because
it'd be something it took me a couple of hours to do and
thus I can't keep it in one presentation.
But yeah, I loaded the open telemetry sdks so you can
see us changing. You can see some
of the libraries have similar namings.
So for example there's an open census trace. Well there's also an open telemetry trace.
So that idea of traces and spans are similar across the two.
And then I have this idea of auto instrumentation and go,
we're going to have wrappers around the various libraries. So in this case I have
a gorilla mux router and I want to automatically get
insights on my different HTTP requests. So there is a library out there that
exists that automatically does that for me. I import it in and then just
wrap my router when
we go through in opentelemetry. This is going to be pretty common across
all the different coding languages. But we're going to go through and we're going
to create some kind of exporter. We need to send our data to
a location. I've removed some vendor specific
information in here, but the gist is you're going to send it to some API
endpoint. If that API endpoint is secure,
you're going to have to be careful with that and that
you're going to want to have some SSL credentials.
So this is something, this is a nuance I found. Going through
go doesn't automatically infer if it's
HTTP or HTTPs. You have to add in these blank
credentials to say hey, this is going to a secure
endpoint, but if yours going to an unsecure endpoint you wouldn't need that
specific piece and we're exporting in that OTLP
format over GRPC. The next thing we're going to do is
we are going to create this tracer. So what this trace does
is it's going to automatically propagate the trace context, the tracing information,
connect and create that unique id.
And then we're having this spam processor that's just processing
our spans. We have a batch span processor. Rather than hitting the endpoint
for every single event, it's going to batch them together so that you can save
some network bandwidth and we just have to add a little bit more contextual
information. You added a service name for my front end here.
Beyond that, there's a couple of other pieces in code that
I wanted to add. You can see at the top that
R use middleware. And when you look at the code, that is me
taking that opentelemetry middleware and wrapping my gorilla
mux router and getting that auto instrumentation piece. That's what I
wanted. Right. Beyond that,
the microservices demo uses GrPC to communicate
to all the different backends. So there is actually an open telemetry
GrPC wrapper as well. And so I was able to utilize that to automatically
add trace, connect and span durations
for my GRPC calls. So great rest work for
me. But I do need some
manual instrumentation.
Ultimately, if I'm going to ask questions about my system,
I want to understand what is going on in my code. So for example,
maybe that session id, I get a support email and
I can look at that session id to be able to see what happened
for that user. So not only am I understanding high
level details of how my system is performing and how my calls are doing,
I now am empowered with extra details for
the variables and code. I have things like an email,
in this case a zip code, state, country and session. But I can add
absolutely anything that I think will be useful for me later when I want to
ask questions about that data. So with that
I was able to deploy it. The Google microservices
demo uses scaffold to automatically deploy in this Kubernetes
environment. This is in our AWS cluster and
honestly I just took that front end
instrumentation and I copied it.
There is three other go services and how you do it in one
coding language, you reuse that copy and
paste, just change the service name. I uses the same like GRPC
wrappers because they also use GRPC to communicate. It was
something that was pretty easy once I got the first front end working.
So from there I needed to do my next service
and I was looking downstream, like I said, front end to backend.
Well this front end is touching this ad service.
Java is really nice. So for Java
it has an agent. This is different from the other
libraries and other coding technologies. Those use sdks.
Java has an agent that can hook into the JVM itself. And as
it hooks into the JVM you just set a couple of environment variables you
can see down here on the bottom and
it's coming to the rescue. It's just going to hook into and has a long
list of automatic instrumentations, tons of different libraries like
spring and databases and HTTP calls,
Kafka. It's going to wrap those automatically for you from
the JVM context and you're using to
get a lot of good insights in Java.
Java is Ga. Now when I originally did this and you look at
the code, I still need to update it. It was version like 16
or 17, but now that it's Ga I do need to eventually update to
the newest Java agent. So starting with this Java agent,
that's great. But there was existing open census
manual instrumentation and one of the things I wanted to do as I said is
reuse it, right? And so you can see how
the terminology is very similar
from the open census to opentelemetry. And this is common
across a lot of the other languages as well. But here you can see
I wanted to add attributes and in
open census it was put attribute. Now I just switch
it over to set attribute, easy change, add annotation
became add event. So now I have a span event and I have that logging
information with context about my Java application and that's
awesome. Java also has a really neat thing
in that you can take the manual instrumentation and sdks
and hook into the automatic instrumentation. So it has
this app with span annotation that will automatically wrap
your function, call here and do the tracing and the
timing for that span, that get add span, which is
great. And then I can get this span current to
get the specific span in the auto instrumentation,
that point in time that I am at and add in different attributes.
So that's that set attribute, the add event, et cetera. This allows
me not to have to sit there and manually start and stop my
spans. That's something I was trying to avoid if at all possible. Just that's a
little bit more work with this. I got going
pretty quickly and I started seeing ad
service information in my code or in my vendor tool.
So I'm going to continue down that path of I want to trace
across multiple parts of my system.
Moving down from the checkout service I see that it touches two
node services. So I decided let's
just go there next.
So inside my payment service this is one of them node is
a little bit different and this is common theme. There's going to be a little
bit nuances. The node code uses this tracing
js file and that's going to be start up
with the node command. So you'll see in the docker image for this
in the source code I just added hey start up
with this tracing js and I'm using to go through
and do similar things. I'm going to have an auto instrumentation piece.
Note is nice in that you can wrap or
rewrite the javascript code. So in
this case I have plugins and I'm loading in a GRPC plugin. Once again
GrPC calls is what it's making and an HTTP plugin,
but there's multiple plugins out there. There's an exprs plugin.
If you go on NPm you can find them and it's also
on the GitHub and it
might not be yet in the docs, but I'm sure it'll be there soon.
You just Npm install these plugins.
In reality I shouldn't have had to actually manually put
this plugin code this plugins and enabled and location it should
have automatically according to the docs done it for me. It wasn't working.
So I manually said hey, these are my plugin names, this is where they're located,
please enable them from
there. We're setting up a collector again,
Java didn't have it where you needed to do this. Create SSl. But no,
does I have to create SSL credentials if I'm
sending to a secure endpoint? And in this case for this
specific implementation I am. And with that I'm
also exporting in my OTLp format and GRPC.
So I have to be mindful of using the right nodejs
libraries. GrPC is usable in the back end so I could
use that and my otlp format so I can send it directly
to this vendor. With that creating my
trace I'm using to in this case just kick it off,
register it and add the instrumentation auto
pieces and go see what is in my
UI eventually. But there is
some things to understand. There's more nuance,
it's not always great. For example,
when I wrote this nodejs and it still might not
be ga, I'm pretty sure it's still not ga. But it's working towards it.
Node not being ga means that there are going
to be sometimes changes to the spec,
the APIs and how you do things. And this is a case of when I
originally did this. Compared to now, there is a little bit different
on how you might do things. I will eventually update
the repository to use the new method and add in the versions
or update to the newest version of the node JS opentelemetry. But essentially
it's similar. Instead of a plugin, it's an instrumentation.
It's still loading that auto piece and wrapping around it.
All right, so now I have my two node services. Once again
I'm just copying and pasting
my tracing js and I'm going to move on to the next piece.
And that next piece I decided to go downstream to this
cart service. Coming from that checkout service, it's net core,
and with net core and net there's
once again more nuances. So these are things that I had to kind
of work out going through the documentation, going through the GitHub repos.
Net uses the built in Microsoft profiling libraries.
And so there's a little bit of differences in namings.
Like if you were to use the manual instrumentation, you'll see they have a little
bit different terminology on adding attributes, putting attributes, et cetera.
But at a high level, getting that automatic instrumentation
was very similar. Pretty straightforward in that I have a startup file
where I am configuring my services and I'm using to
initialize my telemetry. So I add
my tracer, my open telemetry tracer to the services
itself. From there I'm having my
instrumentations. So instrumentations is my wrappers.
These are my automatic instrumentations. Automatically take
that trace id, automatically take the
durations on how long pieces took and
less work for me.
With that I then also want to add an exporter once
again OTLP format, and I'm going to a specific endpoint.
I remove this vendor specific code in here
because you might have to add things like API keys and the
vendor URL where you want to send the data, but you
should be able to take this and apply it to different
vendors, at least from a reference standpoint. And then
of course finally we see that same issue.
We need to make sure empty SSL if you are sending
to a secure endpoint, and in this case it's GRPC secure endpoint.
And I left the automatic for the net. I didn't add
in the manual instrumentation yet. That is still on my to do
list. So I decided to move downstream. Once again,
all that's left is two Python services. I'm getting close to
the end to be able to see this grand trace of
my system, to see communication between services.
So I have this email service in Python, and I
personally have not used a lot of python in a production environment. I've used
Python from scripting, but not really in terms of a web
application. So I had to do a little bit more reading up
and figuring out how a requirements in
work. And so using that requirements in, I was able to
once again remove the open census code and libraries
and references and instead add the open telemetry stuff.
The documentation was a little bit lacking on the Python side.
I think it's just the nature of things are still growing and
are still in flux for some of the languages, but it will get there eventually
where we're following the same thing we did before. We have
an exporter, and with our exporter we
are doing OtLPF format to our endpoint like before with
empty SSL credentials. Great. We have a
trace provider, a tracer, whatever have you want to call it
in the different languages. And in this case we give it our service name and
we're adding our spam processor. We're just going to simply export to
this location. And then I wanted to add some manual instrumentation
as well. There is this server interceptor.
This is a hotel
specific wrapper for my GRPC server,
and it allows it to get that trace id from the upstream
calls and automatically add it to the python calls.
And I like that automatic instrumenting. As I've mentioned
before, I kind of want to get going quick and see what I get out
of it and then add my manual instrumentation later.
So we got that.
What does it all look like? Did we actually do it?
And the short answer is yes. The long answer
is it took me a couple of tries. There was a
lot of learnings in that where I didn't get tracing
propagating, right. I had to add the, for example, in that go piece,
I had to add that automatic GRPC
instrumentation, because without it it wasn't propagating the trace ids.
And that's really the problem and why I wanted to do this is
there's all those good examples individually, but we really need
more examples of a complex environment where microservices,
architecture, talking to different types of code environments and tracing through
it. And so here's the grand trace.
This is in honeycomb because I work for honeycomb, but it might look similar
in the tool that you choose. And we can see that we are
tracing across services. So it's taking that trace id from my front end,
sending it to my checkout back end. It's telling me how long
I spent. That's great. This is the power of a trace and a span
and a trace waterfall view is I can see where most of
the time is being spent. I can follow through and
I see from that java piece or
the go pieces or whatever have you, they all have these ideas span events,
and I can see those span events in this case as those dots,
those are maybe those exceptions or log messages that are important
and those span attributes.
So now in my system I can ask questions. I can
see someone@example.com what was
their experience when they did a checkout? And that's
ultimately the thing we want to solve for,
right? We want to understand where the bottlenecks are,
we want to get to root cause, because if your system is down,
your users are not happy. And we want our users to be happy to continue
using our products.
A lot of lessons learned.
Honestly, the biggest problem I had when going through all this
is the documentation can be lacking for some of the languages.
I did have to sift through the GitHub code itself. That is
continually changing already since I did this a few months ago.
There's better documentation. More of the languages are
ga. Opentelemetry is moving very fast
and it is becoming more and robust like daily.
And this is why it's important to have this
nice and open format is you have the mind share of everyone out
there who is interested in it to be able to work on it. That's great.
But with that we saw that some of
the languages are prega. That is something,
a risk you might have to take, but that is once again also continually changing.
More and more languages are becoming ga, and I expect that
at least for tracing, to really be pretty robust on
all the different languages we've seen today to be ga.
Right? Metrics and logs, those are in
the pipeline for open telemetry, but eventually we'll get there too.
There's a different nuances I saw. So I was having troubles
with that SSL piece, and that is something I had
to just dig into the code and figure out why I needed
to do it for some of the languages, but not for
Java. And I figured out that it's just one of the nuances.
We definitely also need more examples. Like I
kind of mentioned that the individual examples are good.
You can go into the GitHub, they all have examples on how to use opentelemetry.
Tons of vendors have examples on how to use opentelemetry for the
individual languages. We need more
examples on complex real environments because that's
where you're going to run into the more nuances, the more
edge cases on how to set up something, how to
get tracing across the different languages, for example. Right.
Hopefully now with this Google microservices demo,
that's going to help some more people out there and I'm going to keep
updating it. Auto instruments, your mileage
may vary, so that is something always to keep in mind. Auto instrumentation
is good, definitely to get up and going. It's not going to
be the end, be all, it's not going to solve all your
problems for you. I don't think it will.
So be prepared to go in to do some manual
instrumentation. And yeah, I need to add back
in some of the health checks. I had to remove them because for
some reason when I added my open telemetry, the Kubernetes
pods would crash because they weren't starting fast enough. But when I
removed that health check to see if the pod was ready, it started
up fine. That's just something I need to do to get this demo back to
a good standing on how the original open census
stuff was. So my next steps,
add more information. Just recently
I wanted to figure out how to use baggage, and so I was
able to dig through the docs, dig through code, figure out
how do I add baggage, and I got baggage working. So baggage
is taking something like that session id you saw
earlier and how can I propagate that as well to all
the downstream calls, not just a trace
id, so that I can set something like that session id on every
single one of my spans. And I was able to add that into the code.
You'll see that at least for the front end and a couple of pieces it
touches. I definitely need to add more manual instrumentation. That's another piece
that might be lacking in some of the documentation is how to get specific
situations set up. So I want to make sure I have all those situations for
any of my customers and anyone out there who is interested in setting up
specific attributes and such for the different coding languages.
Eventually we also want to use this as a demo environment,
and with that we want to be able to add some arbitrary slowness.
Your tracing tool should be able to identify bottlenecks and we want to
be able to showcase that even in a complicated environment we can add slowness
and quickly identify it. And you should make sure your tool that you're using can
do that. As well. Obviously I just mentioned health checks need to go back
in and then finally I need to update to the latest versions.
Feel free to make some prs to the code,
it's fast. The open telemetry stuff just a
few months has just updated multiple versions already as everyone
is working towards that GA for all the different languages.
So eventually they're using to get all of them will get
to stable and it'll be perfect and great. Until then just going to have to
keep monitoring and get it to update.
Thank you. I really appreciate you taking
the time to watch my presentation. Here's some of the resources.
At the bottom there you can see the microservices fork that I
created on our GitHub repository
page. Also you can look at the
Opentelemetry docs and GitHub. That's where you're going to find
a lot of your information to really understand how
to uses it. And then there is also a Slack channel, the CNCF
Slack does have open telemetry channels for you to ask questions there as well.
Thank you.