Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, everybody. Thank you so much for joining me.
I am Nocnica Mellifera. Let me put on
the full face to say hi. Hi, everybody. Thank you so much for coming out.
This is open source observability with open telemetry.
Noshanika Meyerflare. You can find me most places at serverless bomb.
You can also just google the name Noshanika, turns out, and I come up.
So that's fun.
Given in association with telemetryhub.com. Go check it out
while we're talking this through. Okay, so what
is observability? Frankly, observability is
a term is much more familiar on the west coast of the United States
than it is across the entire tech sector. So I
think it's fair to say, hey, hopefully you're here
because you understood something about it or you've heard the term before.
It is not a single tool or a special case or
a standard. It is a design criteria.
And I think of this as observability
as being the time to understanding and not just know. People like charity majors
and the open telemetry project have talked about defining it this way.
We think of our time to understanding of a particular problem or
issue or service interruption is the
first half of your time to resolution. And so
a lot of the time throughout this talk, I'm really going to be referring to
these situations where a service is completely down or otherwise really not
performing as it should. But you can also,
observability can cover these cases where it's like, hey, why is this so slow for
users in this region? Some people have some reports of a bud that we haven't
been able to replicate. These are also problems that observability can address.
Right. It's possible to have a fix without understanding
the problem. This is an example where, hey, you know that eventually the
service runs out of memory, so you go ahead and restart it. And we've all
seen those setups where it's like, yeah, we just
need to restart this thing every 24 hours because we know it's running out
of memory. We don't know why. And so that's can example where we have
no observability, really no understanding of the system, but we do have a fix.
But without understanding, the stress of a particular problem
is pretty high. Right. Much better to have some
understanding of what's going on.
Okay,
so why are microservices a little bit harder
for this? Why do they make the challenge larger? So let's talk about
a historical time where we were really thinking about monoliths
as ways of creating production
software, right. In the era of the monolith, only a
few people understood the whole system.
So most people were working in little areas of it.
They often felt like they were needing the
expertise of a small group of people who really understood the whole system.
But those who did understand the whole system,
they had a very full explanation of problems
that were happening on the stack. And the
biggest problem with a monolithic architecture is
actually not at all about how it performs.
Some people will say, hey, we don't do monoliths anymore because they don't scale correctly.
That can or cannot be true. That's not always the case. But the
problem that monoliths really created was
thats it often took months for someone to become can effective team
member once they joined your community. And with a lot of people
averaging just two years in a particular position,
monoliths just don't work anymore. So you have to have these
microservices so that people can get up to speed on a single microservice
and be contributing within weeks instead
of months. And so that's the
reason for the migration. It's not really because a monolith performs
so poorly. And one of the things thats monolith did a lot better was that
all the information should be available on the stack at
any time that you choose to stop and see what's going on.
So a person who understands the monolith well can very quickly get to
the bottom of a particular problem because all the information is available.
So with microservices, right, someone understands
each of the interconnected dots completely. They completely
understand how that dot works. But nobody
understands the map that covers all of these
microservices obviously have a ton of performance advantages,
scaling advantages, and again, that advantage with how quickly people can
start contributing to the team. But with observability and
any kind of understanding of can outage, in almost all
cases, microservices are going to be a dead weight loss. So they're
going to make the situation worse for being able to.
Like for example, if it's 05:00 a.m. Or 03:00 a.m.
Outage. Once everyone on the team is awake
and the people who understand the system best are awake and have
gotten connected with the monolith,
somebody's going to understand what's going on. But very
often with microservices, one of those common questions I have gotten
when working with observability tools and people have
these very deep microservice architectures, people just say,
hey, on a normal request, no problem. With request, no failure,
how do I find out which services are being
hit by that request? So a simple question like, hey, when they come and check,
but from our ecommerce store, what services are
involved in that checkout? Okay, so that shows you this is
an oversimplified version of microservices, right? They really are
multifaceted, very, very complex, and quickly build to a complexity
where it's very hard to even understand where a successful request is
going. And so we
can move very quickly to thats chaos where it's very hard for
us to understand what's going on inside a microservice architecture.
Okay, let's talk about how we solve this with observability.
There are three major components to observability that we
need to ensure. I'm going to be a little bit quick with this because we're
going a little bit deeper into concepts after this. So we're going
to zip through this just a little bit. But there's really good
write ups on opentelemetry IO about the concepts of logged traces
and metrics, which are the three pillars of observability.
So let's start with metrics, right? When you don't
know what's happening, count something I actually have lost where I got that quotation
from. This is a quotation from a statistician.
One of the concepts for what is a metric is that the speedometer
on your car is a metric,
numerical measurement of a complex system.
So instead of saying, hey, you've just passed slough
and you're going into this next place, and then you're going to get
there in this much time, or these other things, a metric is a very simple
measurement. Hey, you're currently going this fast.
They're very easy, or they should be an easy way to get
a high level view. This is a nuclear control station.
So it really gives you a sense of how you can get so
many metrics very quickly that you dont have a very quick and easy
view, but you do have a high level view of what's going on. And metrics
are also very easy to store in a high volume. So metrics don't present
usually a challenge for like, hey, where are we going to keep all of these?
If you're getting to a point where your database is struggling to contain the metrics
that your production service is generating, either your Netflix and
I'm sorry, or you have an issue with
configuration, things like metric explosion, which we're
not going to get into here but yeah, normally it's very easy to store.
You have logs, right? Logs,
as I say, they always have a complete and thorough explanation of
the problem somewhere, right. But storage and management are their
own challenge. Logs can be so
complex, can contain so much data, that very often
the real challenge is just sorting through them during a crisis.
And so there are people who are of the opinion that if there is an
outage and if something is not working, they really don't want to be starting
with logs. They know that they're not in a good place if logs
is where they're starting. And then finally
is our new entrant in the last 510 years into
the story of observing our systems, which is tracing, right.
They're informatically a hybrid between metrics and logging.
And they're trying to generalize observed time
spans, which is a
little bit obtuse. But essentially tracing is supposed to show
us the components that are hit by a request. And because
we use a modern architecture, those are not going to be
sequential, right. They're going to be multiple time spans happening
all at once or at the same time. And we have
a few more figures here to kind of help us see that.
So, one little side note about tracing. Tracing is
relatively, it should be as dense as
logging, possibly more so.
And one of the journey secrets about tracing is that most trace data is
never viewed. And by most we mean like three nines of data.
The vast majority of trace data is never viewed.
I see my little face is covering my joke there, right?
Really, most of it is never viewed.
That's kind of worth noting when we think about our data retention problems
and other problems like that. Okay, thats is
not what I wanted to do. Let's come over here.
There we go. Okay, so from
tracing, we came to the concept of distributed
tracing. Fix that.
So distributed tracing really
is just the implementation of tracing, but that is able to track
an event between multiple microservices. So here
you see this request being passed around,
which is creating multiple events which are sent to other APIs
and getting back responses. And thats each time this is happening,
there's some kind of persistence going out that is saying,
hey, here's the stuff that we want to log about what's happening, and we
want to be able to connect all those together. We don't want to just be
filtering logs to see that connection. We want to be able to see it easily
that this request is connected.
So at a very high level, how does distributed tracing happen?
Right. You add a trace header somewhere close to the start you
pass it around with the request and then you have some collector
side logic or some data gathering side logic to stitch those
pieces together.
So the goal of tracing is to get something like this waterfall chart,
right, which is showing us here are the components that were hit by
this request, and ideally seeing them in some kind
of hierarchy to say, hey, general, we had a request to the API.
It had these components that were hit. These were the ones that were running simultaneously.
Here's how long they took. So beyond just,
hey, this went to here, which again, as I mentioned earlier,
is often where people come in as that's what they really need from the system
is they just say, hey, I want to know what the
heck is being touched by this request.
These x widths here have a meaning. They have
a meaning of how much time something took. So you see a lot of discussion
when we talk about tracing of spans, which is the measurement of the amount of
time that each of these components took.
And then you get some kind of visual indicators of what
was blocking what. Right. Like in this case, auth needed
to be completed before we could get to payment gateway at dispatch.
Okay,
so once we start
thinking about distributed tracing, one of the problems that we run
into is this problem of how
do we get these individual pieces to communicate.
And so in the sort of closed source SaaS world,
there were these efforts to say, okay, well, we'll create a
library for maybe front end measurement, for measurement of your back end system,
for measurement of your database, and then we can tie those together.
If you use our closed source tools, use our SaaS tools,
we'll be able to tie those together into a single trace.
But as microservice world started to explode,
it really got difficult to negotiate
that trace header value to be passed successfully
between all these things and a single company, a single
effort, no matter how big, just could not maintain
a system that could be installed everywhere that would successfully pick up
this trace, report it successfully up to their system
and give you this nice unified trace. There were always going to be
these large black boxes within your trace where either
the trace data was totally lost or it's just, yeah, we were waiting for something
here we don't have observation of what happened.
So that is how we get to this point with open telemetryhub.
Open telemetryhub and the history of open telemetry and distributed tracing
are intimately linked as this is a project to define
can open standard for the communication between components that
distributed tracing can work successfully. Open telemetry covers
the other components of observability too, as we'll get into.
But this is kind of where we start.
So a big key idea with the open
telemetry is this thing of the collector. So while open
telemetry is in part is just a standard for the communication of metrics,
trace and logging data, to say, hey, here's how thats should be transmitted.
And that's supremely useful for distributed tracing because it means if you work
on your little project for instrumenting,
laravel symphony or a particular build of rails or
what have you, you can follow these open standards and
be able to get traces that you can tie together. But there's
this kind of superpower involved there because we
mentioned that there's these steps to creating trees. And one of the key steps is
we have some way to tie those traces together,
right? And that is one of the
problems that is solved by the open telemetry collector.
So the collector is where a lot of this magic happens.
And let me zoom in a little bit on this chart. So you
have these open telemetry standards and they can communicate,
but to a third party service, as you can see up here, and I'll mention
a little later that one of the ways to get started is to try just
directly reporting from your service up to a Prometheus endpoint
or up to another open telemetry endpoint.
But one of the other ways to do it
is to be running a service that is an open telemetry collector, where you have
your multiple components thats are reporting over
into the collector, and then the collector is saying, okay, let me go ahead
and write out really nice,
clear observability data.
And the collector is not just a data explorer
or a sort of data middleman. The collector
has all of these multiple components that can do things like filtering,
batching, attributing, and so attributing,
adding attributes, I don't, thats doesn't feel like attributing. I don't know, it feels like
a separate word, but whatever. So these
processors are a key part of the story with the
open telemetry collector where
these questions that previously maybe from a SaaS servers were pretty hard to
cover. Like hey, I had this very particular kind of
PII data, like specific format of health data and I
need to filter that out and make sure it's never sent, even if it got
observed accidentally. Instead of waiting on a SaaS
company to say, oh well, don't worry, we'll implement a filter for that, the collector,
you could just go ahead and grab a processor component and do that
filtering. And since a collector can be run within your own cloud,
you can say, hey, I want to do this filtering before it's ever sent
along the network.
Along with these three pillars, there is this concept in open
telemetry of baggage where you're able to add
a little bit of information that gets passed along. So an example might
be client id. It's kind of a classic one is that all
of these microservices are maybe seeing this thing, but only right at the start did
we see what their client id was, and we'd say, yeah, because that's useful
to us. To tie this together to add filtering data later, we're going to add
this baggage that is this client id. Now, baggage is not reported automatically.
It's not like an attribute on a trace, but it can be useful.
You can explicitly say, hey, I want to go ahead and check this baggage here.
And if we got a client id, I want to write that to this trace.
So yeah, that's kind of this. The idea of
baggage is right, is just sort of something that contains a little something else thats
comes along with you. And so it's very nonspecific
about what it may contain, but it can be a useful concept as you're getting
a little bit more advanced.
And support for open
telemetry is a lot better than you think. And I say that because I
was actually writing one of the write ups of hey, here's the state of open
telemetryhub support. And I commented, oh, hey,
maybe Ruby is kind of not ready for use. And this is because I
was looking on the opentelemetry IO page and
just seeing like, hey, in know a couple of these things are listed as not
yet implemented, but small. But of
the way shops like Shopify use the
Ruby open telemetry project. So pretty
advanced actually. Even though metrics right on this table at the top
level are listed as not implemented. You can actually, if you click in, you see,
oh, they're experimental, but a lot of people are using lemon production now.
So it is great that there is this sort of top level list of like,
here's the level of support. And obviously for obvious reasons,
like traces are kind of the first thing that's implemented. But I
really think it's worth a look. And especially because so many of these languages,
it's only logs that are missing. And the fact is you've had a
way to report up logs and filter logs for a long time, almost certainly.
So that's not really going to be the missing piece for you.
So what are we talking about when we say hey, how's this language
support? This means what is the state of the open telemetryhub SDK
for this language, including automated instrumentation.
So in languages like Java and Net,
you should be able to get a ton of metrics out from this
project, automatically doing instrumentation for you and automatically writing
it to whatever data point you want to send it to. So getting
back into that just for a moment, ways to get started
this is from the AWS blog, but one of the things to
remember is that you do have this option about whether or not something is going
to go to a collector or go to some other data
endpoint. And so what's so cool about the collector is lets you
decide how the data is going to be batched and how it's going to be
filtered again, removing Pii and doing other kind of clever
stuff with your data. But if you want to have things work just from day
one, if you want to just try things out, having stuff report directly to Prometheus
is totally an option that you have.
And if you're doing stuff like you want to report metrics every
few seconds or you want report individual spans for a trace,
yeah, that's going to result in a lot of network requests if you're just
reporting directly and you dont have batching and stuff with the collector, but that's
fine for a beta project or a proof of concept.
And then obviously once you do implement a collector,
it's very easy to change over. Also if your data is
quite predictable, if you know what you're going to be doing, if you're using handwritten
calls to report up data. So maybe you're managing pretty well,
you're matching without having to define that on the collector's
side. These are all really good reasons to say,
hey, I'm not going to implement the open telemetry collector quite yet.
Okay folks, that's been my time. I want to thank you so much for
joining me again. Go check out telemetryhub.com for a
really nice, cheap, efficient way to go ahead and report up
open telemetry data. So that's an open telemetry endpoint. So that's
the collector and endpoint get a little bit disclarified there, right?
Your endpoint is where the collector is going to report its data
for users to be able to go and see it. I'm Nocnica mellifera.
You can find me almost every place at serverless mom and
I want to thank you so much for joining me. Okay. Have a great
conference.