Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, and welcome to this talk today on
observability and some of the data science that you may need to consider when
you're looking at an observability solution. My name is Dave
McAllister and I'd like to thank 42 for letting me come and share some
of my observations about living with observability,
especially in production systems.
I currently work for Nginx, the open source web server
and reverse proxy model. I focus on
a lot of outreach. I've been open source geek from way back, actually before
it was even named open source, and have continued through
that period of time for quite a while
now. So with that, let's get started on talking about
some of the data side. Well, first of all,
let's point out that quite often you hear a number of
rules or laws that fall into place.
The 22 immutable laws of marketing by
trout and rays for has the twelve immutable rules
for observability, or the 70 maxims of effectively
maximally effective mercenaries, which is really quite hard to
say. And of course, some of these laws are enforced a little harder
than others, such as the law of gravity is pretty strictly enforced
for this. But we're going to talk about just a couple of rules when
it comes to observability. So let's get started.
Well, first we need to sort of talk about
why we're even interested in this space. And the short
answer is that we're all on this cloud native journey, and we're
designed that way to be able to increase the philosophy,
the way things are transforming, keeping up with customer
demands, outpacing our competition. We need to be agile and
we need to be responsive. This has rapidly
accelerated, especially over the last couple of years, where we've had even
more of an online presence in mind. And more
and more, we are engaging our users and
our customers through these digital channels. So every customer is
on this cloud native journey here, whether their goal is migration,
modernization, new app development,
the companies are doing to increase that velocity,
and they're finding that cloud native technologies help them do so,
but could increases complexity.
There are more things to monitor, if you will.
Cloud is an enabler of transformation, but how do you maximize
this investment you're making in cloud, and make sure you have visibility
into everything that's happening in that environment. So the
farther along you are in adopting kubernetes and using
microservices, looking at service meshes or other cloud native technologies
that don't even fit on this slide, there's thousands of potentials.
It's really hard to have the visibility into everything that's happening.
You can't just monitor a monolithic stack. You can't just look at the flashing
lights on the server. You've got to look at the hybrid,
the ephemeral, the abstracted infrastructure. And this
becomes very hard to manage and understand.
And so that visibility, what we call observability,
says is table stakes. It is the basis
that we need to make sure that we are modernizing successfully
and efficiently. And because of
that, while we've been monitoring for quite some time,
we have to make sure that monitoring involves to use usability,
observability here. The reason is that we need
to be able to look at these new complexities.
Okay, you've heard of observability, obviously, but what
is it? Well, it's the unique data that allows
us to understand just what the heck's going on in these apps and infrastructures.
And it's a proxy for our customer experience,
which we've learned are more important than ever.
So monitoring kind of keeps an eye on things that we
know could fail, or the known knowns. And observable
systems allow teams to look at issues that can
occur, including unknown failures and
failures with root causes that are buried somewhere in a giant field
of maze passageways that all are identical for here,
it alerts us, ask questions that we didn't think of before
it goes wrong. So real ability goes
beyond monitoring to detect these unexpected
failure conditions, as well as making sure that we have the
data behind the scenes to be able to
figure out what went wrong. And so getting
these insights, taking those reactions, are something traditional.
Monitoring tools, and Floyd products in particular, really weren't designed
to handle. So keep in mind that observability and monitoring
work hand in hand. Observability is the data. It allows us
to find things that are unexpecting. Monitoring keeps an eye
on things that we know can go wrong. They both work closely
together. But observability is
built for certain challenges. Here and over here,
you're looking at what's called the Knievan framework. And hat tip
to Kevin Brockhoff, particularly for this drawing here.
When we look at a monolith, we're in a simple environment. You look at it,
you see what's going on, and this is the best practice sense.
Categorize and respond. It's broken. Fix it,
if you will. But we've added two new dimensions.
One, we've made things more complicated. We've added
microservices. We've added loosely coupled communications
pathways. We're building these things so that there are
lots of building blocks, and those building blocks may interact
in very different and complicated ways.
Likewise, at the same time, we've added another totally
different environment, which is the ephemeral or the
elastic approach here. Now things scale
as needed and go away. So when
you go to look for something, it may no longer be there, it's ephemeral.
Think serverless, for instance. A serverless function does its job and
gets out of the way here. And so putting those two together, we end up
in a complex environment. Now, complex environments are not
just sense, it's not passive. We want to actually
probe into what's going on inside of this environment,
then we sense what's going on and respond to it.
Our could native world lives in this complex world,
this emergent world of functionality. And so
when we look through all these certain pieces, it's very clear that we
have some very unique capabilities.
Traditional monitoring cannot save us alone here.
Multitenancy can be really painful inside of here.
And honestly, failures don't exactly
repeat. In fact, it's quite unusual for a failure to repeat exactly the
same way. But of course, observability,
and you've heard this before, better visibility,
precise alerting, and end to end causality. So we know
exactly what happened, where it happened, why, what the system believes
was happening, when it happens, which gives us a reduced
meantime to clue and a meantime to resolution
here. So when this happens, though, we also
can get answers when something goes wrong. Be proactive in adding
that to the monitoring stack so that we can watch for it in the
future. But this is all
data intensive, and the amount of data that's flowing into our
systems now is massively increased. There's just so
much more data coming into that between looking at metrics,
looking at distributed traces, and looking at logs, which make up those
three categories. So rule one, use all of
your data to avoid blind spots. And I kind of mentioned
this, but observability really is a data problem. We have a
ton of data that's coming into here. Generally speaking,
we look at three classes of data and observability metrics,
traces and logs. Or do I have a problem? Where is
the problem, why is the problem happening? And each of
these pieces becomes equally important.
They all actually overlap.
So metrics can yield, or logs can yield metrics,
traces can yield metrics. Metrics can point you at the right
trace of the right log. And each of these pieces gives us additional information for
the monitoring and assist in our recognition of issues to help
finding that underlying root causes here today,
we generate tons of data. I've got hundreds of services that
are calling it here. Every transaction generates orders of
kilobytes of metadata about that transaction.
Multiply that for a number of small number of concurrencies,
and you can suddenly have megabytes per second or 300gb
per day. That's coming in for a single concise
application space. You need all this data to be
able to make, to inform your decision making. In fact,
yeah, it can be a terabytes of data, but you
need to be able to be aware of where your data limitations
are, and your data decisions can make you have problems.
So going forward here,
data is this driving factor, and it drives things like our
new artificial intelligence and machine learning directed troubleshooting.
It also has this thing called cardinality. So cardinality
is the ability to slice and dice.
So it's not just, I got a metric that says my cpus
are running at 78%. It can be broken down into
that this specific cpu is running at this,
or even this specific core, or this core running in this
virtual machine on this particular infrastructure environment
is running in this. We also are now living with streaming data.
Batch just doesn't cut it when we're looking at
analyzing our data for this. And we want
as much data, full fidelity metrics and as much information
our traces we can get, while at the same point in time.
We also want it to be standard space and open source available
so that we're never locked into those various pieces.
But what happens if you miss ephemeral data? What happens
if the data that you need is no longer available, or here,
or your data shows something that you can't drill into?
Well, there's some interesting things.
First of all, let's take a look at this massive amount
of data coming in here. And quite often you'll hear people talk about that.
Data in mass is noise.
Well, and there are ways of dealing with that noise here and in this
observability space, they really come down to either filtering
the signals, which can be linear, which smooths it,
destroys sharp edges around here, blind pass,
which removes the outliers for this,
or even smart aggregations pieces.
We'll go a little bit more into sampling. And sampling can be random.
Grab one headbase, tail, base, post predictive dimensionality
reductions. But the real answer is improving the
visualization. Use the data, but improve the visualization so
that you're not lost in the complexity of the data and that you can drill
the data into what you need it to be.
So I mentioned sampling. So let's talk a
little bit about sampling. So trace sampling
is a common approach to doing this. Traces are noisy.
They produce a tremendous amount of data because the trace is actually
looking at your request as it moves through your system. Especially if
you move from a front end environment, your client's laptop,
for example, all the way through the back end, through the various infrastructure
pieces, there can be a huge amount of data here. And so quite
often you see people do trace sampling, and this
routinely misses troubleshooting edge cases. It can
also miss intermittent failures because it's not
being sampled correctly here. If you don't understand the service dependencies
in a microservices environment, in particular, you can get alert stars
because several things may fail because of
one point. You also need to be able to look at not
being simplistic on your triage. So you're seeing an
error happen. You need the data to be able to determine where the error is
actually occurring. And so if you have too
little data, then you end up with a simple approach to this and you
may find yourself spending a lot of time actually looking for where the underlying
cause lives here. And then if you're just looking
at observability as sampling, you're separating your application
environment and your infrastructure environment and has, we all know those
two pieces work pretty closely together and can have impacts
beyond the traditional oh, I'm out of space,
I'm out of cpus. But it can even have things such as noisy
neighbor problems or communications bandwidth problems.
So all these things become a problem with observability sampling.
So here we're looking at a typical tray sampling. And there
are two types of things here. There's headbase and tail base.
No one in the right mind really totally goes after some
of these others. It's easier not to do this than looking at dimensionality
collapse. So, head base sample, let's look at the
sample and then decide whether to keep it or not.
And so the head based sample catches things that are outliers
and can catch things that, you know, are errors.
However, the problem is that a trace may not lead
to an obvious known error. Remember, we're looking at unknown unknowns.
And so when we do this, all of a sudden a head based
sample, while reducing the data flow, may remove
that single sample. That would answer your question.
When you're looking into the underlying causes here,
tail based sampling, which is a little more random. Okay, I've got 100
traces. I'm going to save these. Ten is a little
worse at this. So think about it. 100 traces. I'm going to save
five. That means I have a 95% chance of
missing that outlier. So head based sampling catches my
outliers, catches my known errors. Tail based sampling may
catch it, but probably misses it, depending on how much you sample.
But when we start pulling all the data into place, we've got
the ability to go back and reconstruct, if you will,
the scene of the crime. So another example.
So this is sampling, and in the first instance here,
data one is sampled, data two, no sampling obvious for
this, and they are running the same environment. They are different times
for this. So in sample one, I'm actually pulling
a sample rate that's only looking at pieces of
this sampling is giving me a selection bias. It's not
showing me necessarily things that are out of range for this.
So in this particular case, sampling came back and said that my duration for
my traces is in the one to two second range, and that
even that high end range, I've got one trace that fell outside of
that. I'm sure you can simply look at the
data there and start getting some interesting error indications
about how much is actually not being looked at. Take the
exact same thing where we're not sampling, this is showing the
same application and impact. And all of a sudden you can start
seeing some of those errors creeping into here. Now, my latency distribution
is showing that I have a trace that's in the 29 to 42nd
range. In other words, this is causing
you to have a blind spot. And in an ecommerce world,
29 seconds is known as a lost sale. I've given
up purchases for less time than that while waiting for them. But nonetheless,
no sampling is giving me better results and a better
understanding of my user experience than the sampling environments.
But I can clear you right now. Wait, my metrics tell me everything,
and that can be true. Your metrics are not usually
sampled, particularly for your infrastructure.
They may still end up being sampled for your tracing data,
because tracing data is, in its own right,
ephemeral. It stops when the request stops, as well
as massive. And it comes in not necessarily
in a metrics basis. It comes in, in a form that you can extract metrics
from, usually rate, error duration, or red monitoring here
leading to this problem is that you start
missing those duration results and you can even miss
some of the alert structures less so if you're looking at a metric space
and you're capturing all the metrics data that's inside your traces.
But your duration data may be impacted
by seeing this. After all, when does a trace end
is one of those always interesting questions. So one
whole third of this observability space resides on tracing
data. Here, duration is probably the single most important
element for a user experience to be looking at
inside that trace data.
So, TLDR, your ability
to use observability is dependent on your data source, and your
data source needs to be as complexity as possible. But don't
let the chosen data bias your results.
Don't force yourself into selection bias before you have a
chance to understand what selection bias is meaning to you,
and keep it all, understand it all. Otherwise you
don't track customer happiness as a proxy here. And finally,
getting the data in real time matters because we need to know when things
go wrong as soon as they possibly can go wrong.
And we'll find out more about that when
we hit rule two. So rule two states,
very simply, operate at the speed and resolution of your app
and infrastructure. And again, this starts falling
into a data problem. But we're going to start by talking about
this thing. Some of you are probably familiar with what's known as the von Neumann
Bottleneck and the blind Newman bottleneck, very simply is the idea
that computer system's throughput is limited due to
relative ability of processors to the top
rates of data transfer. And so our cpus have
gotten a whole lot faster. Our memory is not that much faster,
so the processor can sit idle while it's waiting
for memory to be accessed. And yeah, there are ways around
that, but this is a fairly standard model
for computer systems today, is the
memory and the cpu are basically
competing for a bottleneck of transfer.
When we look at that from observability, there's a similar
issue here, a little different,
but the resolution of our data, the precision and
accuracy coming in is massive. The speed
of our data is the deterministic response. How fast
we can get things in is usually
less than the resolution of our data, and that impacts the way we
aggregate, analyze and visualize this data.
And honestly, by the way, data is pretty worthless unless you can do that
aggregation, analysis and visualization.
And that resolution and speed together impact
the insights you get from the data that's coming
in. Pretty straightforward, makes perfect
sense. But when you start looking at this at volume is where
life gets interesting. So I've mentioned
precision and accuracy, and we need to talk a little bit about this.
We discuss these things as if they were the same,
but quite often we need to understand that
they're not. Accuracy states that the measure
is correct. Precision says it's
consistent with the other measurements. So if I measure the
cpu each time I measure the cpu, I want the
number percentage of its in use category
to be precise. I mean accurate. I want
it to tell me what's actually happening. It's running
at 17%. Is it running at 78%?
I also want it to tell me when the cpu
is in the same state. Give me that number again.
So if it looks at this and says it's 17, now it's
78, now it's back to 17. I want to trust that that
is actually consistent with the measurements that are seeing here.
Aggregation and analysis can actually
skew our precision and accuracy categories.
So here's another example.
And what this is, is looking at the number of requests per second going
through a system. So this one is not duration based. This is literally the number
of requests going through here. And as you can see over the first
ten second viewpoint, my aggregations are
going to happen. If I look at a ten second
aggregation, my average is 13.9 and my 95th
percentile is 27.5. If I look at
just the first 5 seconds, I have a 16.4 and my 2nd
5 seconds I has an 11.4.
Both of these miss the fact that one of my
measurements has gone over 30 requests per
second, which happens to be where I had set the
alert. Now again, easy enough to say bring
in the data, show me things when it crosses 30, alert me. But your aggregation
may never show that. The aggregation may show you at under 20,
nice safe number in all these categories. And so when
you start looking at that functionality, you can see that aggregation
can change that aggregation. So,
simple example, but now multiply this times the
hundreds of thousands of requests that can be rolling through your system at any given
moment. So,
data resolution, the speed at which we pull data in and reporting resolution
are never, well, not never,
seldom ever going to be the same thing here. They both can
be problematic. So throwing away data points
means that you can't go back
and reconstruct. So always deliver those data points regardless of
what your reporting structure looks like here. And the finer your granularity
is, the more you can have potential precision, the more you're likely
to get the same number for the same results. So this is a
simple little kind of drawing. When you measure something,
when you measure it, it's actually in the middle of your granularity
point somewhere, but you're not quite sure where.
So in things case, if I'm looking at that, I'm measuring on
a second by second basis where it falls
relative to that second boundary, varies by the size
of that boundary. So milliseconds give me finer granularity,
and picoseconds give me finer granularity. That granularity
may become more and more important, especially as we scale
things and especially as things get complex,
but at the same point in time, it creates more
data. So there's a trade off between granularity,
data resolution, porting resolution that you need to
consider. And when we bring this in,
we have native resolution. How often do we collect data
and chart resolution, which is the aggregation points that
we use. So in this particular case here, I can be
showing you that I'm bringing in requests per second, but I'm actually aggregating
these on a ten second basis. And so
we want to be able to look at this so they make sense. We get
to watch things move, we can look for this.
Humans are incredibly good at pattern matching, and so we need to be
able to break it down and take a look at what's going on here.
So, native resolution, data collection, chart resolution,
the aggregation that we're using for our charts and graphs,
we want speedy data collection and sufficient chart
resolution so that we can understand what's going on.
Add to that complexity. We talked a little about
this complexity. We now have compute cloud
elasticity, cloud compute elasticity. Spin up more
kubernetes, spin up more, spin down more serverless
functionality, pull in the functions when they need them,
and have those pieces happen. That's the ephemeral side of this.
But we can also have drift and SKU. We are now running in
infrastructure that is not all single based. These are virtual
machines in general in a cloud environment, and we're running wherever,
we actually don't kind of know where we're running inside of here. So when
we start bringing data in to aggregate it,
we can have drift and SKU. And so we need to kind of measure
how far ahead or behind the data
source coming in is. Behind where we currently are.
Drift is a continual something, is showing it becoming faster and faster.
SKU is something that's just out of line with everything else.
And keep in mind, this is not just the computer infrastructure, it can
also be the networking infrastructure. So it's worthwhile
looking for ways to manage drift and SKU so that we understand
the functionality. Easiest way, once again, is to keep all the stupid data
so that you can keep track of where your drifts and skus are handling.
And when we get into this data, we really want to
understand one basic construct.
Most of what we do for the observability and monitoring space in particular is
be predictive on alerts, safety too,
or have an alert based on something.
But predictions are data intensive.
Again here, if something's stationary, straight line.
Oh, yeah, piece of cake. We know what the distribution curve is going to look
like. Yeah, you can set a strategic threshold. If not,
we look for sudden change. If we have a linear trend,
things are always going to be sloping up. And if they don't,
then all of a sudden we want to do this, then we can start looking
for things like resources running out. We're approaching the end of our
block of capabilities for this. If it's seasonal,
we need to look at historical anomalies and historic anomaly
can have some interesting impacts. Here's one where I'm going
to say, don't use the mean, use the median for here. And then
all these pieces give us the ability to do some
level of predicting and alerting.
Sorry. So what we really want to know is things predictive
behavior. We want to know what's coming. We not necessarily really
care as much about what's happened. We can usually
track that, but we want to know what's coming as well. Prediction is
only good as that. Precision and accuracy. Can you trust the data?
And is the data telling you the truth? Right here.
When we look at this, at the historical change environments,
we want to make sure that we're using the right thing.
So a sudden change basically says, I was running along fine and all of
a sudden I've got 70% more demand coming in,
let me know. Historical says on Tuesdays
at noon, my workload drops down
so I can shut things down. But at 01:00 it comes back
up. So let me plan to bring things back up.
Oh, look, this Tuesday at 12:00 the workload didn't
fall off for that. And of course, trend lines being stationary.
In any case, with predictive behavior, you can expect to see some
level of false positives as well as false negatives inside of here.
Does it make sense to compare current signals to the observed value
last week? Or could values from the preceding hour
make a better baseline? Sort of that historic change
versus historic anomaly versus
sudden change structures. However, whenever you're
looking at this, particularly at historic data,
use the median. The means will float up and down. Should you
have outliers, the medians pretty much have
a much more comfortable range that they stay in.
So again, summing up observability is only as
useful as your data's precision and accuracy. If you can't trust it and
can't trust it each time, it's worthless here.
And you need to consider the elastic, ephemeral,
and SKU aspects of these complex infrastructural
environments here. And while we look at prediction has a target,
we need to keep in mind that there's a difference between extrapolation and interpolation,
and keep in mind that in any case we may end
up with false positives or false negatives.
And finally, a closing thought here. The most effective debugging tool
is still careful thought coupled with judiciously praised print
statements. Brian Kernegan said this back in 1979
in Unix for beginners, observability is
the new print statement. And with
that, thanks for listening to me today. Thanks for letting me come to the
Conf 42 conference and enjoy the rest of your conference.