Transcript
This transcript was autogenerated. To make changes, submit a PR.
You welcome.
Today I'm coming to be talking about observability versus performance
monitoring, the difference between these two ideas and
why you should care about them. So before we get started, just a quick
overview of what we'll be having about today.
First off, why do we care?
Going to look a little bit into the history of monitoring.
Going to think about why
monitoring has had to evolve over time.
We'll look at a high level overview of observability
and where whats term came from and what it means.
How you can benefit from observability. Talk a little bit about
the three pillars. We'll do a recap. We'll talk
a little bit about open telemetry as a path to observability,
and then we'll wrap it up and you can be on your way.
So first off, why do we care? We really care about
this because we want to work on the good stuff. We don't want to spend
our time debugging, troubleshooting, doing support.
Our companies also don't want us to spend our time doing this. It can
be very expensive for an organization to have an entire engineering team
doing troubleshooting, support and not working on future
looking features. Additionally, just for our quality of life,
we don't want to spend our time, our free time doing this.
You rarely take a job with the hopes that we'll wake you
up in the middle of the night to try to keep the lights
on for an organization. So the better and more
stable your environments can be, the better off everybody
in the organization really is.
So starting at the beginning,
in the good old days, back when I was actually doing
development, we were working with typically a single code base.
So it was something that could be run locally. So if I
needed to step through, something on my computer could
bring the code down, run it in a debugger with some breakpoints,
and really be able to understand what was happening from start to finish.
So this is monolithic architecture, and it's great at what
it's good for. So I'm not here to talk
about the differences between microservices and monolithic architecture
or why one is better than the other. I would say there's
appropriate uses for the appropriate case. But monolithic
architecture is kind of where monitoring evolved.
So performance monitoring, this really came out of
the data center. So your applications were running on
hardware that you could have some insight into.
You knew what you needed to be able to keep an eye on to
make sure that everything was up and healthy. We'd be able to kind
of take a look and see what are the trends, what's happening here?
Are we headed in a bad direction? Are things pretty stable?
And then this is where alerting really came into play.
When do I need to stop what I'm doing and pay attention to something that's
happening within my infrastructure or application?
So this pager is a picture of
the exact same kind of pager that I carried back in the day.
I would say that a lot of people who have been
on call are familiar with this sort of thing. And the idea is
whats you want this to beep at you as little as possible.
So sounds like monitoring pretty much has it covered. It tells us
alerting can tell us when we need to drop everything and
fix some stuff. We know the health of our servers and our applications,
so why would we need to know anything else?
The distributed systems are why we need to know more than
just the high level aggregate.
This architecture brings with it a lot of benefits
like improved scalability. It can be more efficient to work in,
it's easier to do kind of rapid deployments.
It's also easier as people join to be able to understand a
smaller piece of the overall system and ramp up quickly and
be able to contribute quickly. But what it does introduce is more complexity,
which makes it more difficult to monitoring. While you can monitor a
lot of differences, parts of the system effectively, it's hard to
get a really good understanding of what's happening from start to finish.
So when there's issues, it's a lot more challenging to know where those issues
are.
Another aspect of kind of some of the
new technology that's emerged is ephemeral resources.
So it's a lot harder to monitor something if you don't
know when it's going to be there and when it's going to disappear.
So you can't set a monitor for something that you can't
see. So you really need this sort of thing to kind of automatically
pick up. And when it disappears, you no
longer want to be alerted about it because it's supposed to work like that.
It's not a negative that something has
shut down, it's doing that to save your resources, but you don't want to get
pinged about it every single time it happens. So this is another challenge
of monitoring today.
So makes sense things are more complex, we need to step
up our game, how are we going to do that?
So that's where observability comes into play. So the answer
to the ultimate question of life, the universe and everything is
observability. It will give you all the answers you need to all of the questions
you could possibly ask in theory.
So observability is really the ability to understand the state
of internal systems by observing the output.
So the idea is that you can collect information
that will tell you what's happening within your system and
sounds just like monitoring. And that's because it is.
Monitoring is an aspect of observability.
And if it is everything that you need to be able
to answer the questions about your system,
then you have an observable system.
So terms get a little bit muddled and we'll dig into some of that.
So the things that I want everybody to remember is
that the outcomes are more important than the labels. So whether or not you call
it observability or monitoring really doesn't matter. You just need
to know that you can support your
applications the best way possible. It's also
something that's never done. So just like software evolves and advances,
you need to update your monitoring or observability.
It's not a checkbox where you can just say, all right, now we're observable,
and set it and forget it. It doesn't have to be all or nothing.
So I think a lot of people shy away from exploring new methods
of introducing observability into their environments because it
sounds overwhelming. But I would say you can just kind of start
small, get familiar with what you need and what you might like to use and
kind of take it from there. Additionally, like I said, it's not
a set it and forget it, it's a spectrum. So your systems
can be very observable or very opaque, and what you want
to do is kind of get to the level that you need to be successful.
So again, it really doesn't matter what you call it. What matters is
that you can answer any question, whats you might need to ask of your system.
But I'm going to keep talking about observability because that's the name of this talk.
So the origin of the term observability, it's a measure
of how well the internal states of the system can be inferred from knowledge of
its external outputs. So this came from the general theory
of control systems from the 1960s. So even though we've just
started to hear about this in the past couple of years, it's not a
new concept and it's not a new buzzy term. This is
an idea that's been around for a long time, can be applied to
basically any kind of system, but obviously now we're talking about
distributed systems and software architecture.
So it sounds like all I need to have
to be able to answer any question that I could ask. My system is all
of the data. If I have all of the information, I'll be
able to answer all the questions, right? So one approach
to that would just be logging everything. If we have every little
aspect of everything that's happened, then we'll be in good shape,
right? Not really, because you're kind of building
this dumpster of data that is hard to navigate.
You don't want to spend all of your time diagnosing through logs
that can be just as ineffective as throwing hypotheses on a
wall and trying to fix something in production.
So you really want a better, more thoughtful
approach to how you want to kind of achieve observability
in your system. So the
big question here is going to be, do you know the unknowns? So monitoring
lets you answer a specific known question like what's a response
time? What's my average response time? Observability really takes
it a step beyond that and will let you say something like,
oh, one of my customers is having a problem,
but only one of them. So what's unique to them that's causing issues.
So that's where observability really comes into play. It lets you work
with higher cardinality data to really be able to
kind of slice and dice information that you're gathering from your systems
to be able to get to root cause analysis very effectively.
So you might ask yourself, can I just buy something to do this? So you
can. There's a lot of tools out there that are labeled as observability
tools, but it's going to take some sweat equity to make it valuable.
So just because you have purchased an observability
platform or tool doesn't mean that your system is observable.
So you really want to be sure that you're picking the tools that are right
for your environment. So the thing whats is right for
a six person startup is not necessarily coming to be the right
tool for an enterprise
company with thousands of people. So you really just want to be
thoughtful about your approach to this and not just jump on the new kind
of buzziest thing that whats your inbox
from a vendor.
So how does any of this help me?
Can observable system will really help you fix problems that
you didn't anticipate and be able to navigate
requests across your system in a way that you weren't able to do before.
So what I always kind of like to highlight
here is that at a company that I worked at, we had an engineer,
we'll call him Bob, and he knew everything about
one legacy part of our platform.
He whats the only one that knew it. He was the only one that supported
it, unfortunately for him. And I think that's a common scenario
that happens. And the risk that you run when you kind of silo knowledge
that way and you don't make an effort to kind of make that
part of your architecture more observable, is that if Bob
quits or something tragic happens,
you no longer have any insight into that and it leaves you in a really
bad place. Bob gets hit by a bus and suddenly you're trying to
reverse engineer something that nobody has any familiarity
with. So it'd be a lot better if you were getting some useful outputs of
it. So it's kind of the idea if you need to
slow down to speed up, it's better to put a little bit of effort
into this before you need it, so that you're not caught on your heels when
you do.
And just to kind of paint that picture again.
So you've been paged, so what's going to happen now?
A lot of us would try switching it off and
then switching it on again. So when it's the middle of the
night and you're getting paged and you don't have all the information you need to
really know what's going on, but somewhere in a runbook it says just
restart the service. If this particular state
gets reported, then that's probably what you're going to do. You're not going to spend
a lot of time trying to figure out the root cause, and you may be
off call tomorrow and this is going to be somebody else's problem.
But really what you want to be able to do is make your systems better.
So to do that, you need to be able to answer some questions about the
sort of incident you want to know who's being impacted. And if you're just working
with aggregate data, you may not be able to understand that. So if I
know who's being impacted, I might have a better sense of urgency. So if
it's something like a canary in production triggered,
can alert, I might be able to ignore it until tomorrow. But if
it's potentially like our biggest customer and they just onboarded a
bunch of users, it really may be an all hands on deck scenario. But having
that information lets me really evaluate and assess the
priority of the incident. Do I have
what I need to resolve it? So do I have enough information available to me
to either resolve the issue or kind of hand it off to somebody who
has the information that they need to without just sort
of flying blindly into an issue.
I want to know where the problem is. So I don't want
to just start kind of at what
I consider the beginning or the end. I'd like to have some information about
where this is happening. If it's something that is caused
upstream and I'm just feeling the pain of it here at the end where the
customer sits, I want to know whats. I don't want to have to kind of
guess. I also want to know when the issue started. So we
want to be aware of whether or not this is something that we've
been trending towards over time, or is this very sudden.
And we also want to know how we ended up in the state so that
we can prevent it from happening again, obviously.
So that brings us to the three pillars of observability.
So we've talked a little bit about some of the kind of more
conceptual ideas around observability, and we're going to get into some of the
more kind of nitty gritty of the nuts and bolts of what people
consider traditional observability today.
So metrics is a good starting point. Metrics are
intended to provide statistical information aggregate.
So this is what we're all kind of familiar with. It can give you a
really good indication of the current state of things,
and it's a great place to set your alerting. And this is more
like that traditional monitoring where it's kind of at a high
level. It's a really good vehicle
for storing information about your systems, but it's not
great for doing diagnostics because you've lost all of that
good kind of connective tissue data that the
metric is made up of. So once you have an incident, you can't
drill in any further to understand what happened. So if you see a spike,
you kind of have to do your own correlation. When you start to dig into
your logs based on timestamps or other information that you have built, it's not
done for you.
Distributed traces are the shiny new things
that we've gotten with application observability. That's really exciting.
So traditionally, a trace traces something within a particular
location. So I always think of like a traditional database,
you can run a query, you can trace that query and understand
everything that's happening along the way. But a distributed trace will let
you do a similar kind of following
of a request, but it can hop across different resources,
which is what makes it really kind of magical when you're thinking about
trying to understand maybe a customer experience. So you
know that they started by trying to make a particular request,
say they're like trying to check out, you're selling something and they're
trying to check out. So it starts there and it may hit
a whole bunch of different back end services. So you may be looking at customer
ids, you may be looking up skus, you might be
checking inventory and whats sort of thing. And those may all be different systems.
So with distributed tracing, you'll be able to trace that the
whole way.
So this is just a look at what a distributed trace might look like.
And what you can see is that the trace itself
is comprised of spans which are little units of work.
And this is demo data. This is up until midterry demo data,
but it shows you that we're coming across different
resources and languages. So it really
kind of gives you a good visualization of that start to finish sort of
understanding you can get of something that is requested of
your application.
And this is just a look at one of the spans expanded.
So you can see that we've got some
kind of custom resource information here, and we have detailed down
to the level of the actual product name.
So that National Park Foundation Explorer scope is an actual product
that we've looked up. So just kind of highlighting just how
granular you can get with your span and your trace
data here.
And then logs are another really important part of observability.
So earlier I said you don't want to just dump everything
into your logs and assume that that's the best foray into
being able to resolve issues. But logs really do
hold a whole lot of really great information that can help you troubleshoot
things, and they're more powerful when you can correlate them to other signals,
like a distributed trace or a span. So the great thing
about logs is that they can have
really high cardinality, which means that you've got more
independent pieces of data that you can kind of pivot on. So something like
a user id, an organization id, some of your custom
resources from your services, that can really help you understand
things at a very precise level, as opposed to a more aggregate
level where you're looking at things maybe over just like rolled up by
a time by a service name or something
like that.
So to quickly review so far, just want to reiterate
that collecting data does not make a system observable.
You do have to collect data to achieve observability but collecting data alone
will not accomplish that for you.
The value really lies in the ability to answer questions.
So again, when we talk about outcomes instead of outputs, this is
the outcome that we want. We want to be able to answer questions that we
need to ask of our systems, to have healthy
and have systems that we really understand thoroughly.
So one of the downsides of just amassing a lot of data,
storing it for later in case you ever need it, is that it's very expensive
and it's hard to navigate and it's just wasting space and resources
for you. So just kind of hammering home the point that
just collecting data is not the answer that we're looking for here.
And when you kind of start on your
journey to observability, there's a lot of solutions out there.
So you may find yourself really kind
of experiencing fatigue with the different tools that
you're attempting to implement, the number of sales calls
you're getting about different sorts of observability tools, and really just kind of
the concept of observability as this kind of huge
army of different tools and services you need to implement within
your environment. And that's not necessarily,
it can be simpler than things, but it can be kind of overwhelming
when you're trying to figure out the best approach for your
needs. And again, your needs and your team
will really dictate the solution needed. So you
can start very simply and small if that's what your team needs.
So you don't need to go all in and buy the most
expensive, shiniest thing. It's not necessarily better for what you're trying to
accomplish. So you really need to evaluate and choose the right
solution for you, be that something that a vendor provides or
something that you can kind of build and maintain in house. It really
depends on your specific situation. There's not
just like a one size fits all approach for this.
And that brings us to open telemetry. So before we jump into
this, I do want to just say open telemetry is not the only way to
achieve observability. It's something that we
really like at telemetry hub because it's introducing a standard for observability.
And that standard lets that correlation of
the different signals.
It's managed very effectively by the open telemetry instrumentation
so that you don't have to do it yourself. So it takes a lot of
the effort out of achieving a really observable system
and does it for you through a really amazing project.
So open telemetry is an open source
project. It's the second most active CNCF project after Kubernetes.
All of the big players have kind of bought in and
started to provide support for it. It's integrated directly into
a lot of cloud native stacks, which is great and it's
fairly simple to use and there's a
lot of customization so that you can
really instrument something that's specific to the details of your application
so you can really understand what you need to know. But it's a
great project to be able to sort of start simply. There's some great documentation on
the website about how you can get up and running some really great tools,
whats you can users. And again, when we talk about it's not
all or nothing, you can kind of start playing around with this, get an idea
of what it can do for you, and if it's something you want to explore
without having to go all in and spend tons and tons of cycles on it.
This really introduces a shared standard, so it provides a
shared concept of those metrics, traces and logs that we were talking
about, and a shared protocol for sending and receiving those signals.
It comes with the sdks in a lot of popular languages and things
sort of in varying degrees of maturity, and all of that
is available on the website. So if you
can understand where all of those lie and the great things about it being open
source, whats if your preferred language isn't as mature as you would like
it to be, you can contribute to it.
So the components of the project are really the cross language specification
tools to collect, transform and export the data,
the sdks and the auto
instrumenting and contribute packages.
So you might be saying, I thought that open source meant you have to do
it yourself. Sometimes that's the case, and you can make this
a very complex implementation if you want to, and that's where your
journey takes you. But it doesn't have to be. There's some really good auto instrumentation
that you get with a very simple implementation of open
telemetry. So it doesn't have to be hard.
So this is just a quick screen grab from the telemetry hub
documentation, but this is basically what you would need to kind of get started
instrumenting a Python application. Pretty straightforward and
again like great place to start, and then you can add complexity
as you go and as you know what you really want to be able to
get out of your system.
Another really cool thing you get from open telemetry
is open telemetry collector, so it can receive,
process and export your signal data, but it's a lot more powerful
than just that. It's also, just to
clarify, you don't have to run the open symmetry collector to be
able to get your signal data out of your application. Once you've instrumented,
you can actually send that data directly to a backend, but the otel collector
gives you some really good control with that processing step so
that you can be very particular about what you're sending to
your back end,
you know, sounds good. I can instrument open telemetry in my
application and in my infrastructure, and that'll give me all the information I need
to achieve observability. What do I do with it?
So this is the great thing about open telemetry, is that it
gives you a really vendor agnostic approach to generating
and sending your telemetry data. So you can send
it to us at telemetry hub. You can send it to one
of the big monitoring vendors like Datadog. You can keep it in
house and build your own tools around it. You can use other open source solutions.
It really leaves you in a good position to sort of try things out
and see what works for you and also to let your observability
implementations kind of evolve over time. So if you outgrow
a solution, you don't have to rip out proprietary agents and install
something new. You can just point your signal data somewhere else. That gives you a
better visualization for what you want to use.
One of the other things about the open symmetry collector that helps to support this
is that you can send data to multiple places. So if
you want to keep your log files in house, as well as sending them to
a log exploration tool, you can do that using the collector.
All right, so one quick analogy to
kind of bring this into the physical world conceptually,
and we are all set. So I
stole this from our engineering lead,
Lance here at telemetry hub, and I really like this kind of illustration
of observability. So if you think of home
cooking, so it's you by yourself in your kitchen,
you're the one that's touching everything, so you know exactly what's happening. So when you
create your scrambled eggs for breakfast, you know,
when you took the eggs out of the fridge, you could theoretically know
how cold the fridge was, you know, whether you
put in milk or butter, and you
kind of have all the information you need to be able to understand why
that meal turned out the way it did.
But once you move into a restaurant, everything kind of
turns on its head. So if you've ever worked in food service, you know this,
but there's many stations, and the bigger the restaurant,
the more complex it is and the more things that can go wrong along the
way, because an order can pass through many different stations.
So it starts with a services at a table,
taking an order, she may pin that somewhere for
the person that's executing to start the
order. And it can already have fallen apart right there. And you're not
going to know as easily as you would if it was just you by yourself.
So she takes an order,
the chefs work on it. It goes down the line through all the sous chefs
who are adding salt and adding sides and all this, and it
ends up back on the table. And the soup is too salty.
But we don't know who did that or where it happened or how to prevent
it from happening again because we don't have all the information we need to be
able to really understand our entire restaurant system.
So the system being all the different things we're using and the people
involved. So just a good way to kind of think about
observability. And when you need more or less. So if it's
just you in the kitchen by yourself, then maybe the thermometer
on your oven is all you need to be fully observable. And you don't need
to invest in anything more complex or more expensive than
that. But in a restaurant scenario, you may need a lot more monitoring to
be able to really understand what's happening. And that can be applied
to your system.
And so that is it for me today.
Love to talk about this stuff, so feel free to email me.
Sarah@telemetryhub.com and thanks for listening.