Transcript
This transcript was autogenerated. To make changes, submit a PR.
Are you an SRE, a developer,
a quality engineer who wants to tackle the challenge of improving
reliability in your DevOps? You can enable your DevOps for
reliability with chaos native create your
free account at Chaos native Litmus Cloud day
my name is Dave McAllister and I'm here to talk about Murphy's
laws for observability. This is going to be covering
a couple of interesting concepts, both of how we look at observability,
as well as how Murphy, which were all familiar with,
apply to what is driving our new need for this
new paradigm of observability here.
I'd like to thank you for joining me today to let me to listen to
me talk about observability, as well as I'd like to
thank Conf 42 for giving me the chance to get up here and talk about
this. But first, I'm Dave McAllister.
I'm the open source technology evangelist at NginX,
part of f five, and my role is to help
people understand both how to get involved in the
Nginx open source projects, of which there are actually a lot of them now
as well, has how to best make use of the open source aspects
I am an open source geek. I started with Linux in version
zero point 93. I've also been a standards walk,
and as you can probably tell, I'm perfectly willing to talk. But almost anything
at drop of a hat. Here's my LinkedIn. You can find me on LinkedIn
at Dave Mack Davemc, and I'd love to hear from
you. However, nobody is
just their job role. So let me share a couple of other interesting data points,
or interesting to me at least for things. One, I'm owned by
three cats, maybe four, because we now have a little
kitten who has moved into the backyard that were trying to figure out how
we can tame her. But I am owned by cats, which means I
am absolutely used to being ignored. I have also spent ten years
as a soccer ref football for those of you in Europe and other
sensible places in the world. So I'm also used
to people disagreeing with me. So feel free to do either one
of those things, but I'm hoping that you won't spend too much time ignoring me.
So let's start with Murphy. Murphy's law is very simple.
Whatever can go wrong will go wrong. And this is something we're
all familiar with. We constantly see things that we
think were going to go right, and all of a sudden something has changed around
that. But when we add the first corollary
to this at the worst possible time. Then life
starts getting interesting. It's not just enough that something
has gone wrong, it's always when it makes a major
difference. And so we need to start looking at how we can sort
of mitigate some of this impact of what Murphy
is doing. And part of that comes into this whole
concept around observability. There are lots
of Murphy's categories, and in fact I'm introducing one now here.
Murphy's for observability here. But there are things like Murphy's technologies
laws, or Murphy's military law. The military law. One of
my favorite ones is if you need to find an
officer, take a nap. By the way, that works just as
well for vps in high tech as well. Or the technology
law logic is a way of arriving at
an incorrect conclusion with absolute certainty. But you'll find
them on love, on cooking, on cars. And there are spinoffs
such as the axioms and admissions, as well
as even into our humor environments. 70 maxims of maximally
effective mercenaries so Murphy has a
big impact on all sorts of things. And people are constantly
coming up with new things that make sense for a
Murphy's law approach. But let's jump into
it. Murphy's law for observability number one,
if you perceive that there are four possible ways in which a
procedure can go wrong and circumvent these, a fifth,
why unprepared for will promptly develop.
So far I have yet to hear a better description of
the life of an SRE. Our jobs are to
try to both mitigate what could happen
has well as be prepared for that fifth way to show
up. So when we start looking at this, this becomes
the necessary points to make sense of
our environments. We need to know what's going on at all times.
And not just the oh look, the lights are blinking approach.
We need to be able to look and see what's going on and how it's
impacting our users, our systems and our environments at
any given time. And that leads us to this concept of
observability. Observability is a hot topic in
the SRE world, in fact, in almost all of the technology world
here. And observability is really all around data.
It's the deep sources of data that let us see what's going
on inside of our systems, inside of our applications,
all the way from the ground, all the way up to the
user viewpoint and user journey experience going through
this overall system basis here,
generally speaking, you'll hear it as metrics. Do I
have a problem traces? Where is the
problem and locks. Why is this problem happening?
So detect, troubleshoot or root cause analysis.
However, observability is not limited to those classes
of data. Observability can and should make use
of any data that's necessary for us to be able to
understand and infer the operation of our
underlying system. And in fact,
observability is about those deeper sources, the new sources
of data, and the data that tie our environment together,
that lets us understand at each point in time what's happening here.
Don't limit yourself just because you have a source of
data. Be ready to look at more sources of data inside of were.
And interestingly enough, observability really is a proxy for
customer happiness. If we can understand what's going on and understand
the driving influence on our user base, then we can
actually use observability data to help us understand their experience.
Down at the bottom, observability has been around for a while is what the engineering
definition is designing, defining the exposure of
state variables in such a way to allow inference of
internal behavior. We've expanded that our
internal behavior now encompasses a lot of different points,
and we need to be able to also correlate across those.
And that leads us to Murphy's observability
number two here. Every solution breeds new problems.
So now that we've got this new data, we now have a whole new class
of issues that are coming into play here. There's also
the underlying concepts of where this data is coming
from. Why do we have all this data here? Well, this is
the Knievon framework. Knievon framework is a way of approaching and
look at the transition over time of an activity
here. And we start with simple we used to have monolithic systems,
and we used to have single source languages,
and we used to be able to look at the blinking lights and say,
oh look, things are working, so things must be fine here.
But now we've gone into a cloud environment.
A lot of things have moved to public and private clouds here,
which means we now have elastic and ephemeral things
change. Things may not be were when we go to look for them.
Therefore, failures don't exactly repeat.
And because we now have microservices, a service that
is as small as necessary, that's being pulled together through
a loose communications mechanism, were we find that debugging
is no longer as traditionally capable as
it was. Therefore, traditional monitoring can't save us
anymore. What we've now done is something that you would never do in math
class. You've changed two variables at the same time here
we've added the complicated world of microservices,
where they run, how they run, how many of the services are running at any
given times with the elastic ephemeral behavior of cloud environments
or orchestrated environments here that give us a chaotic
model. It's not there. In fact, we already planned that it
wasn't going to be there very quickly for this. That have led
us to this complex environments. Complex environment means
we need to be able to probe deeper, we need able to sense
better, and we need able to respond in ways that may not be as
clear as they used to be. In a monolithic phrase.
This is massively important and it is
the driving change for what is creating this buzz
around observability. And that leads us to
Murphy's observability number three here. You can never
run out of the way that things can go wrong.
And the octopus riding a unicycle, juggling balls,
is a perfect example of this. We got lots of things going on.
There's lots of balls in the air at any given moment here, we're now keeping
track of not only the virtual environments,
the communication environments, we're keeping track of our orchestration environments.
Kubernetes as an example for that. We're keeping track of
the applications and the application pathways can change every
single time that a transaction crosses those pathways.
And that gives us this thing observability,
lets us start monitoring for those things called unknown
unknowns. So when we know something could happen or
we're worried about it, we know to watch for it and we probably understand
what caused it. This can be running out of disk space or
running out of memory. We can watch for those things and we kind of know
what we're doing here. We can also be looking at
things that we are aware could happen, but we don't necessarily understand
why they happened. These can be outside influences
for this. When we get into the unknown capability,
things can happen that we are not aware could happen,
but when they happen, we can immediately understand, oh, that's why
that happened. And that's an unknown known. But when we move into
that last category of unknown unknowns,
we're not even aware something could occur. And when it occurs,
we don't understand why it occurred. And so observability gives
us the ability to do that forensic exercise. Now let's just
move back, in a sense in time to see what was
going on and basically best infer
and understand, to deduct what happened
at that given moment. Once we've done this, we now
can move this into the next category. We are now aware that
something could occur, maybe move it into a known category.
We may still be a known unknown. We may not understand why it occurred,
but we now know to watch for it because it
has occurred. That means it could occur. And once we've done the forensic
exercise and the resolution, we actually want to make sure that we don't
run into the same category of having to start over again. And so,
observability gives us the ability to move from the unknown
unknowns into the known unknowns, and even into the known knowns.
Try saying that really fast, three or four times.
So, sounds really great.
But Murphy's number four tells us nothing is as easy as
it looks. And that's because we're
building in two different ways here. This is a microservices architecture.
In fact, this is an ecommerce architecture. It's got checkout services,
it's got the Internet coming into a front end, it's looking at cart
services, it's emailing things out here. It could even be doing recommendations.
There's lots of things. If you've ever touched any of the major ecommerce
environments, you're probably seeing a front page environment
for a single product that's somewhere in the neighborhood of but 43 microservices.
Every one of those microservices connects to probably somewhere
between four to eight additional microservices,
and that's for a single transaction.
Now imagine that you are scaling that, that you suddenly
have 100,000 transactions going into your system at one point
in time, and that 43 plus each
of those additional pieces here now has to scale to manage
the volume. We'll talk a little bit more about scale here.
The headache is that every time we look at one of these things,
we need to know where in the cycle we are when
the problem occurred. That becomes incredibly
important information. Nobody really can
grasp the entire architecture in
a gestalt, in a single picture viewpoint.
And so we need to have our capabilities, our services
and our tools help us understand that.
And that's what's led to things like service maps,
where we can see how the services connect, where the
transactions go to, and what's happening in each independent
transaction. We can also start looking at what the
metrics are telling us. Metrics are the piece that lets us
know when something has gone wrong. And so
metrics are incredibly important. Were, and then even into the transaction
level, we can look at something called red rate era duration,
one of my favorite monitoring patterns of all times for this.
And that will actually help us understand what the user's
experience is. Keep in mind that we do have
this concept where users are unique in individuals.
They really only care about this transaction, the one
they're looking at right now, how long it took and whether it was successful or
failed. And so red gives us the ability to look at that
in concrete overview, so we can look
at the aggregate model and then we can drill into
it should something show up. So for instance, my red environment
here is showing a 25% error rate. I would love to know
what's causing the 25% error rate. I would love to know who is it
impacting my service map, if it's smart, can actually show
me where things are not going through. That's a little red dot that
you're seeing here, but I now understand the flow of the transaction
and where its stoppage points are. So metrics tell me
something not looking right. Traces are showing me
where something might not be looking right with
this is that added complexity? Now we've talked a little bit about
these here with cloud based lack staticity.
When we see a single service, that single service is not necessarily
a single instance.
That service could be multiplied times the
number of elastic pieces needed to meet the
scale. And because it's no longer necessary
to scale the entire thing, here's my monolith. I'm running out of
space. So here's my next two monoliths. It's now just
scale a service that is having problems.
And so our scaling becomes
different. Our scaling is not random, but our scaling
does come into play here around making sure
that the right pieces are scaled the right time. With scaling
up comes scaling down, and so things can disappear
or reappear based on workloads. We're now moving
into this thing called ephemeral behavior. This is where we get into
serverless and serverless functions. And so you can look at this as such as
AWS lambdas or Google functions,
but these things are now designed to not be there.
And when we start looking at the ephemeral
capabilities here, the serverless capabilities,
it's not unusual for warm start AWS lambda to be about
30 milliseconds and for the complete
execution time of the lambda to be about 1.2 seconds.
And so we can see a lot of serverless behavior
that's not there anymore by the time you look for it, it's not were
because we're now also in multiple environments,
multiple virtual machines, multiple containers,
pods, worker nodes. We can also have these two concepts of
called drift and SKU. And we'll get a little bit more into drift and
SKU in a moment. But imagine that
we've got to bring all these things together to be able to correlate
them. Timestamps are the way that we try to correlate them
the best. We can also look at transaction ids or trace
ids and so forth. But to know what's happening at a given moment of time,
we have to be able to align on that time. So we need to
be able to make sure that we understand how the drift
is happening between systems and how the systems are getting
skewed by the changes in the data
over a time period.
So, okay, if life isn't more complex,
things get worse under pressure. Murphy's number five
for that. And that's because we're
scaling and our scale is massive these days.
Yes, we do tend to start off small, but I'll also tell you one
of the things is that testing for production is not the
same as running in production. I don't care has
engineering whatever you want here. Testing for
production will catch a lot of problems.
But keep in mind, whatever can go
wrong will go wrong and it will show up when you hit scale.
So in this environment, I'm looking at 2247
instances for this, and I can't
watch 2247
instances. And so I need to be able to look at this from a tooling
basis here. I'm looking at this nice little picture
viewpoint. I can tell you that I've got some hotspots that's a little sort
of reddish dots inside of were. I can drill into any dot.
The data is there for tell me what's going on
when I choose to drill into it. But in the meantime,
I've also got to be able to take a look and see all the different
things that are happening. But thinking
about the scale here, this is a very simple picture and it's only
one viewpoint. This is simply the
scale for Kubernetes. We have Kubernetes objects,
secrets, namespaces, nodes, ingress points. We have pod
churns and pods versus nodes. Inside of here. We actually
have the containers now inside of pods. So our scale
is multidimensional and our scale unfortunately
does not decrease. So this is one
piece of the picture of what our scale looks like.
This is not the underlying virtual environments.
This is not the application environments with built
on microservices. And it's not necessarily the communication
environments. This is just the Kubernetes led
scale environment,
which leads us to .6 here.
If it's not in the computer, it doesn't exist.
And I love dealing with this one in
some ways, because this one is one that most
times most people will nod their head yes,
if you didn't keep track of it, it never existed in
the first place. And so why does
bad data happen to good computers for here?
Well, one of the things you'll hear, particularly when you have
as much data, has we're now throwing at you, is this thing called sampling.
And sampling is very useful here. And you can have lots of sampling,
you can have lots of capabilities for cutting down
the amount of data that you are receiving
or keeping. But here's an example.
The first one is a sampled environment. The second one is a non sampled
environment. They are the same environments. They're running reasonably
close to the same. They are hitting about the same hot points at
points in time. However, the first one is doing
a traditional head based sampling approach. I'm going to
grab a sample someplace inside of here. And it came back and told me my
latency based on the tracing effort here, was one to 2
seconds. Piece of cake. However, 3.7 seconds
is considered to be where people will abandon their shopping carts,
where people will abandon their pages and go someplace else. Here,
when we look at the not sample data, we actually discover
that our 95th percentile, our traces are
running somewhere between 29 to 40 seconds.
We have unhappy customer, at least one in
this particular case. And so when we look at this,
there's two things we also look at. The first one, the sampling,
sampling didn't show us. I think they show us one error during that sampling point,
because again, when you've sampled, you've got to grab this.
No, sampling shows me that. I've got lots of errors showing up here,
some of them significant. Before we get into it, however,
you're going, well, okay, so I don't sample my metrics, so I know that
were are errors. But what do you do then? You now know there
was an error, but you didn't keep the data. How do you keep track of
what's going on? Oh, well, okay, if I saw an error,
then I saved the data. Think back to that unknown.
Unknown. We don't even know necessarily what was an
error until we get a chance to figure out
post vac that were was an error, and then we need to
be able to go back into it to figure out what's going on. So sampling
is useful but problematic.
Keep track of where you are and make sure that you get the data you
need and the results you need at all those times.
Because as Murphy seven tells us availability
is a function of time,
and the speed and
resolution of our data impacts the insights you
get. So again, I need to
discuss something that's a little bit problematic,
but pretty much straightforward, and that's this concept of accuracy
and precision. Quite often in technology,
we tend to use those terms interchangeably.
They aren't. So accuracy is that
the measurement is correct, that we correctly measured the results.
Precision means it's consistent with all of the other
measurements. So consider that you're target shooting with a bow and
arrow, and you shoot six arrows, and one of
them nails the bullseye and the other five are
randomly scattered from ring three to
ring five, maybe even outside the rings here. My God,
you were accurate, but you weren't precise.
And so which of those measurements was accurate
is a challenge. For here, precision means that was consistent.
So take those same six errors and group them within a two
inch circle, all in the outer ring.
Amazingly precise, completely not accurate.
Observability needs both. It needs accuracy and precision.
But again, that aggregation and analysis can skew
this behavior. Remember back we talked about SKU
drift and SKU? So when we look at that,
come on, when we look at this, this is how
you can actually miss the target for
this here, I've taken pretty much 1 second at
a time, requests coming in per second, and I think I've got ten
of them here. My ten second average is 13.9 requests
per second coming here. My 95th percentile over that 10 seconds
is 27.5. The first five you
can see, the average is 16 and 29, the second
511 and 19. However,
if all you looked at was at the aggregations, you would have
missed the fact that there is a one of
these that actually crossed my trigger
environment that oneup of them went up to. But 32,
31, 32, I can't remember the exact numbers off the top
of my head here. We need to be able to look at every single data
point, not just the aggregations in here,
particularly when we're looking at alert capabilities.
Yes, you need to be able to tailor your alerts. You don't want to be
thrashed to death. But keep in mind that when
you see an alert, you need to be able to use either AIML
technology or your own knowledge to determine
how critical that alert may be and get it to
the right place. Part of the issue is that our precision
resolution impacts actually the data
that we're seeing in terms of that precision I'm talking about here.
So if you're picking something that's being sampled or
being not sampled, but bad sample is a bad word that's being
chosen every second. Then the second that
you're seeing this is somewhere between those two. If you report on a second
basis here, you're not actually quite sure where in
that second that data point is. When we actually have data
now being produced or actually transmitted,
telemetry wise, in the nano and picosecond
ranges here, suddenly this becomes a very large issue. So keep
in mind that aggregation is not your final
point. Incredibly useful for
your visualizations were we've got lots of
data. Data is only as useful as you can aggregate it, analyze it,
visualize it, and respond to it.
So Murphy's number eight, if it can go wrong,
it will.
This one, this one has burned me a few times, has. Well,
for this, and to keep track of that, we now
have this larger, complex picture of the
technology. We now have front end users, and we
have web applications that are now living in the front end. We have back end
systems, and the back end systems are made up of lots of different pieces.
We have supply chain issues, we have packaged apps driven,
connected to microservices. We have hybrid environments with
on prem, with cloud. We have networks all over the place here,
and then we have containers and orchestrations.
Fortunately, we've got the data to allow us to
figure out what's going on with each of these pieces. And so,
synthetics, this helps us test our environment against a
known pathway so that we can see if we're improving or
worse. When we look at this, remember, the user only
cares about his or her personal experience.
Then we have user monitoring, real user monitoring, which lets us
track a user's experience going through the system.
We have endpoint monitoring, where we know where they're coming from, a mobile device
or an IoT device in a car, going through a cave, or from a
desktop as well, has being able to look at
all of the different things that make up that underlying environment. But then
we need to be able to aggregate it, analyze,
visualize, and respond in any of the ways
that are necessary. So here we come into dashboards,
and here we start looking at application performance monitoring. How is
the application performing? We look at the infrastructure monitoring, we look
at incident response, the alerting structures here, we look at
code profiling, what's happening inside of our code here.
And in all of these cases, were still dependent
on looking into that data set for the final
thing for that root cause analysis, which is still probably going
to be a log environment crossing. All of this is
network performance, and so we have to have network performance monitoring that
goes from end to end so that we can truly understand the
user environment. So with
fees, number nine, whenever you set out to do something,
something else is going to have to be done first.
I can't tell you the number of trips I've made to the local hardware store
because I started a project and then suddenly realized that one,
I didn't have something or two, the thing I thought I
had wasn't any good anymore. In particular, I can tell
you it's pvc glue and plumbers putty are my two
nightmarish conditions that I always end up running to the hardware
store to get. So when
we look at this, one of the things that's happened is that we have changed.
We had observerability 1.0. These topics have been
around, collecting logs has been around, collecting traces have been
around. Collecting metrics has been around. The problem is that we need to
be able to correlate them, and each of them was being handled through a separate
agent into a separate back end. Fortunately, now we have
approaches, the observability 20, which does
that correlation. And it's heavily driven by this thing
called open telemetry. So open telemetry is the next version of both open
tracing and open census, two open source projects that
merge together to create a unified environment
to produce the data necessary for observability.
And if you want to get involved, open telemetry
community on GitHub will get you a great
start. It'll introduce you to the concepts, the whole works. It's the
place you want to consider starting were. But I also want
to talk a little bit about one specific piece. Open telemetry covers
traces, metrics, and is in the process of covering
logs. So it's bringing those three classes, the data together.
But it didn't want to disrupt
existing practices and so we had observability
1.0. With their separate back ends and their separate agents.
The collector architecture allows us to tie those together.
You can bring it in in whatever protocol you want.
You can actually process it inside the
collector, should you decide to sample, should you decide to
apply machine learning, any of those things can be done inside the
collector. Keep in mind that as you add things to collector, it becomes more heavy
weight and then you can pass them out into anything you want.
So you can bring it into the open telemetry protocol and pass it out
as a Jaeger protocol or as a Prometheus protocol. You can
bring Prometheus in and send it to both to
a tracing environment as well as to a metrics environment
or keep the logs in. Plus you can bring in
fluent d and bring all those pieces together.
So the collector architecture allows me to be incredibly flexible
about the types of data I collect, as well as the methods by
which I collect them. This means if you already got solutions in place,
it's very easy to pull to an open telemetry
without disrupting your current work.
So I want to cover a couple of axioms. We've covered nine of the
Murphy's laws, but let's start with this one.
This is the Ashley perry statistical axiom.
Numbers are tools. They are not rules.
And quite often we tend to treat numbers as
rules, and that's dangerous. Again,
think back to that. Prediction accuracy. But basically, we tend
to tend to use things as predictive behavior. And honestly,
yeah, sometimes you just want to know what's coming inside of here. The problem is
that prediction is only as good as the data. Precision and accuracy.
Flashback to that. Did I measure the thing? Is the measurement correct,
and are the measurements all in the right alignment?
So, are they both precise and accurate?
But we find this most heavily used for things like historical
versus sudden change, where we can look back and say,
okay, on Mondays in the last four weeks,
the median environment says it
should be here. Therefore, we're out of range. Or we can look at it and
say, hey, wait a minute, this thing's suddenly gone along. And all of a
sudden, we've seen a jump in transaction
request. The latency has gone up fivefold.
So this starts giving us the ability to look at some of those things.
If your trend is stationary, either it's a standard
flipping line or it's a flat line. Yeah, probably pretty safe.
But when you start looking at predictive behavior, you have to
be ready to expect false positives and false negatives,
and so things will not necessarily be absolutely
precise. Extrapolation is
better closer to the point of contact. The farther
out you go, the less likely that you can successfully predict that.
Baker's law, misery no longer loves company, now insists on it.
In our SRE worlds, in our DevOps environments
here, everybody is responsible for things running correctly.
For that, this is now a shared issue.
It's not. Oh, operations needs to fix that. It's where
the devs get involved. And this leads really closely to observability,
because we need to be able to exchange information as
well as not repeat forensic steps. Observability gives us
all this data. Each of the forensic steps, because we are in correlation
mode, means people don't have to go back and rediscover things.
So having all the data gives us that capability,
as well as having the ability to share
the previous environments at any given time.
Make sure that your use of observability brings into
the capability of sharing the data, not just
a result, but sharing the data, its context
and its correlation.
Hill's commentaries pretty there are four of them here, but I love the
fourth one. If it doesn't matter, it does not matter.
So if something breaks and nobody cares, we don't
care either. For know, literally, if the machine isn't
running and we don't get any complaints, then probably nobody even
knows the machine is not running for this. The problem is
my corollary here, it doesn't matter. It does not matter until
it does. Flashback to our unknown unknowns environments
here. Flashback to that concept of we don't know when
things are going to go wrong, we don't know why necessarily
they're going to go wrong. And in fact, the only things we can guarantee
is that something sooner or later is going to go wrong
until it does. When that happens, you need
to have all of the data,
all of the observability data, not data that's been sampled,
not data that's been been filtered, not data that's band bandwidth.
Make sure that you have all of the data here. Observability gives you the ability
to have all that data. Figure out how you can best make use of
it. And finally, Murphy's law number ten.
All's well that ends. And with that,
I'd like to thank you for listening here. Again, I'm Dave
Mack on LinkedIn and I would love to
hear your thoughts and ideas around Murphy's laws
on observability. If you've got a law that you believe applies
to observability, please share it with me. Or honestly,
if you think that I need to expand or don't agree with me,
I'd love to hear from you that has well, so with that, thanks again for
listening and enjoy the rest of the show.