Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey, y'all, thank you comp 42 observability, for having
us today. We're going to be having a panel around observability,
the future of it, and what we can do with current tooling and
what we want for the next tooling. So to get started, I have an
amazing group of panelists here. Let's get started with Adriana
on introducing themselves. Who are you and what do you do?
Hey, I work with Ana.
My name is Adriana Villela. We work together at Servicenow cloud observability.
Let's say that five times fast.
Yeah, I'm a senior staff developer advocate
and I talk about all things observability and I spend most of
my time in the hotel end user sig, where I am
a maintainer there. How about you, Amy?
I'm Amy Toby. I'm a senior principal engineer at Equinix.
I've been there for three and a half years or something like that. And I
kind of started off doing open telemetry there.
And these days I do reliability across the
and write code and all kinds of weird stuff in between the cracks.
My name is charity majors. I am co founder and CTO of
Honeycomb IO, a company that does observability tooling.
What an amazing group of folks that we have here, from all walks of journey
of observability, from end user to doing the reliability work to leading companies in
this space. So the first question that we have for
getting folks a little bit more adjusted to who we are, how did you all
get started on observability? Well,
I was looking for a word that was
not monitoring. And I looked at Wikipedia and I was like,
observability, how well can you understand the inner workings
of a system just by observing the outsides?
That's a pretty cool idea. But, like, honestly, I come
from operations, and my entire career I have hated monitoring. Like,
I do not think graphically, I don't like. Like,
I've always been the person who would, like, bookmark other people's dashboards and
graphs. Because I only love debugging.
I do not come by this naturally. So the
idea is to make more people like me able to be more like Amy.
Well, I guess, like, it kind of starts with a long time ago in
a land far, far away kind of thing, right? Like charity. I've been doing this
for 25 years and change, but I guess
observability started probably around the same time charity started saying it.
But it's stuff I'd been doing for years before. And I think that's
an important point. This is all stuff we've been doing throughout,
but we're getting better at it. So the term
is relatively new in the consciousness, in tech, but I think
what we're actually doing is a better version of what we
used to do. So that's kind of my take on
it, right? I don't remember exactly when I started using it. It was probably the
first time I heard it from charity or somebody else like that.
For me, I got into observability because of charity. I stumbled across one
of your tweets randomly one night and started
following you on Twitter. And then I got like really into it when I was
hired to be a manager on an observability team at
two cows. That two cows,
if anyone remembers those days. I do remember
it was that two cows, but they had changed their, their business.
So obviously it wasn't like downloading Windows software.
They were doing a domain domain wholesale
when I joined. So anyway, yeah, so I was hired
to manage this observability team and I was like, oh shit,
I better know what this thing is if I'm going to manage a team.
And so I started like educating myself
and asking lots of questions on the interwebs and blogging about it to
educate myself and discovered that I really like the stuff.
So that's how I got into it. And you were so good at it.
Like, I just remember these articles starting to pop out from out of
nowhere. Just like, you kept writing about all these things, but it was like so
succinct and so well informed and so grounded in actual experience.
And I was like, this person knows what they're talking about. This is cool.
We should be friends. Thank you.
Well, thanks for letting us know how y'all got started.
And if now in your journey, like, what is one thing that
you wish you can get across? Like, any attendees had? Like, only one
thing that you can nail down today. The way you're doing it now is the
hard way. Oh my God. Like, I can't. I don't know how many times he's
been like, wow, our team hasn't really mastered like 1.0 stuff,
so I don't think we're ready for 2.0, which is something we would never say
about any other version. I'm not ready for my sequel. Twelve. We haven't found
all the bugs in mysql eight yet. No. The way people are
doing, going back to Windows three, it's all about the windows three.
Yeah. Like the way the level of difficulty
of dealing with metrics and cost management and trying
to like, because, you know, metrics, time series databases store
no connective tissue. They store no context, no relational data.
And so when you're trying to understand your systems, the only thing
that connects your dashboards and your logs and your traces and your rum and
your ap, only thing is you. You sitting here guessing
and, like, remembering past outages and
having instinctive, like, intuitive leaps.
It's really freaking hard. And it can be so much easier.
I want charity on that, but I think I'll give it my own flavor,
which is most of my career, at least the last 1015 years.
A common thing people ask me is, hey, teach me to do that thing
you do in incidents, right? And they'll be like, you know that thing where
you show up and you pull up a couple of dashboards and you go,
it's this thing, right? Because. Because I have this
deep mental model of the system, right? And it's so freaking hard to
teach. And basically, like, what we end up saying a lot of times is like,
well, you kind of have to experience it to build that intuition.
That's a terrible answer for teaching anybody anything, right? Like, people need
something to start on. They need some way to start going down that journey instead
of being dumped in a sea of, as charity put it, like,
context less metrics. Now,
as we move into this easier mode and we go into the easier way,
obviously now I can dump my engineers into, hey, look, this is what your
system looks like, and you can click around and explore it and start to
see what it's actually doing instead of trying to reason backwards from a
bunch of meaningless logs, which actually, weirdly, gets worse
with structured logs. For me, I would say I definitely agree
with both of you. I think the one thing that I've noticed
is, like, people still insist on
continuing to build shitty
products and not take a moment to instrument their
code to understand why their products continue behaving
shittily. And so
I think that's like, the most frustrating thing that I've observed is
like, you know, the analogy I like to use is like, your house is burning
down. Are you going to continue building out the bedroom? Because it makes
no freaking sense. Okay. But my product manager said I need a
bedroom feature, and that's the most thing I
could be working on right now. But do you have a front door
that works? Yeah. Right? Product said.
That's okay, right? I have to focus on the bedroom.
We need to win a new bedroom out fast.
The fire is the features that we already built. Those. Those are in the
past. We don't, we don't, we never look back.
That's quite fair. I, I heard some chatter about
observability 2.0. I had a question around where observability is headed.
Would any of you like to chime in?
Yeah. Yeah. Like I think, um, what's going to
happen over the next few years is people are going to go through all their
applications and they're going to log, and then
what's going to happen though is then their CFO is going to call and
say what happened? Um, and then,
and then people have to come back to the drawing board and start over.
And then they're going to start to realize that we need to actually like,
put some context into this, uh, bundle up more of these,
um, attributes and edges and edges,
um, cardinality, right. Like they're the attributes
on there and put them in a succinct way so
that we get value from that data. Right. Because our CFO's
aren't like afraid of spending money. That's their whole job. They love it, actually.
What they hate is when you spend money on dumb stuff.
And so if you want to make your CFO happy, like don't go put logs
on everything, go and use your judgment to put the right
signals in the right places. And usually that's
easier in this day and age with like open telemetry,
auto instrumentation. So if you have auto instrumenter, you do that a few
steps and now you've got high signal.
So I think that's where things are going, right? Is now that we've freed these
SDKs from being like vendor specific, it's no
longer like vendor lock in for me to actually integrate them with my application.
I can immigrate open telemetry under the, usually the Apache license.
So I don't have to worry about licensing, I don't have to worry about vendor
lock in. And so things are moving faster, the tools are
getting better, the tools you all make at ServiceNow
and at Honeycomb and even the open source stuff is all improving rather rapidly
over the last few years. So I think it's just going to keep growing more
and more towards tracing and hopefully we'll see a little less emphasis
on metrics and logs over the next like five, five ish years.
Yeah, I mean, so big question.
Yeah. So like, the difference between observability 1.0 and 2.0, for those
who are kind of new to this space, is 1.0. Tools have
many sources of truth, right? The traditional thing is there are three pillars,
metrics, logs and traces, which you probably know that none of us think very highly
of. But in reality, like, you've got way more than three, right? You've got your
apm, you've got your rom, you've got your dashboard, you've got your logs, but you've
got structured, unstructured traces, profiling,
whatever, many sources of truth, and nothing
knits them together except you.
In your 2.0 world, you have a single source of
truth. It's these arbitrarily wide structured. You can
call them logs, column logs, if you want. I'm not opposed to calling them logs,
but you have the information so that you can visualize them over time as a
trace. You can slice and dice them like logs,
you can derive metrics from them, you can derive your dashboards, you can derive
your slos, but because there's one source of truth,
you can actually ask any arbitrary question of them, which means
you can understand anything that's happening. And so, in the near term,
what this means is the most impactful thing you can probably
do is wherever you've got metrics,
stop. Stop sending money and engineering
cycles into the money hole. That is, metrics. Reallocate that to logs.
And where you've got logs, make sure they're structured. And where you've got structured logs,
make them as wide as you can. You want to have fewer loglines,
wider ones, because the wider they are, the more context
you have, the more connectivity you have, the more your ability to
just instantly compute and see outliers and weird
things and like, you know, correlate and all this stuff, the more powerful your
ability is to understand your systems in general. So while this sounds pretty complicated,
I think it's actually really simple. Fewer, wider logwise,
if I may add to that, because I love that,
and I think it, to me,
I interpret it as treating
all this. Your main thing is the event, and the
event can be made
into whatever you need it to be. Anything can be derived from the event.
You can correlate on any of the attributes, right? Yeah,
exactly. You have your choice at query time what you're going to correlate on.
Yeah, exactly like your. But your base, like everything,
stems from this event. And then, you know, it can be
whatever. But the other thing that I wanted to mention as well
is like, treating observability, is not this like
afterthought, you know, I think a lot of people have schlepped
observability to the end of the SDLC, shifting it
as, as right as possible. And we need to really start
thinking of observability as a more shift left, like an integrated part of
our SDLC. Right. Because guess who's going to instrument
the code? It's going to be the developers, but the qas are going to take
advantage of that. Who do you have to convince? Who do you have
to convince? Oh, that's the trick. I mean you don't have
to convince the developers. The developers love this shit.
But. Yeah, yeah, yeah. It's usually like they're, it's usually
their leadership team that doesn't understand the value. Right. It is obvious and
that's usually. Yeah. Anyway, I interrupted you. Oh no, that's fine.
No, that's a very fair point. And this is somewhere where
I feel like another, like. So there's a super
basic level, you know, it's, you've got one source of truth or many, but there
are so many socio technical factors and consequences that flow from this fundamental,
like the way that you arrange your data on disk change. One of
them is I think that observability 1.0 is really about
how you operate your code. Right. It's about how you run it. It's kind of
the afterthought as you're getting ready to play it. Observability 2.0 is
about how you develop your code, which includes operating, of course,
but it's the substrate that allows you
to hook up these really tight feedback loops and
pick up speed. And there's so much what I think of, just like dark
matter and software engineering was just like, why is everything so slow?
Why you can't do UDa without the first o?
It's just ODA, just orient, decide,
act like that's. I've worked at companies like that,
they haven't really done much, but being
able to see what you're doing, being able to move swiftly with confidence,
it's like a whole different profession when you, when you, when your development cycle includes
professionals, production, like when you look at the door metrics and how year over year
the good teams are getting, they're like achieving liftoff and most of us
are just getting slightly worse year over year.
That's, I attribute a lot of that to the quality of
your observability. I think that how
I see it, right? Like, because a lot of why it's happening is the increasing
complexity of our stacks too. Yeah. And so we got away with it before.
15 years ago, our stacks were limited to a few
servers and a few services. And now
you can have a easily have a single product that has 50 70 services
distributed across the world. And you can't grapple with the
complexity like that with disparate tools anymore. Good luck reading
lines of code and understanding what it's doing anymore. Like, you can't.
If you ever cut, you really can't. Now you can only understand
it by including production into your core
development loop. Sorry,
I was going to say also, like, this is why I get, like, so mad,
too, when you have, like, people at companies who are like, let's continue
doing dashboards the way that we used to before because it
worked for me 15 years ago when it was wildly successful.
But what they're saying, though, right? What these folks are reaching
for is they, they, there is
a way that we've encoded what charity is talking about, like the mental model.
So the reason why people would often hire people like me is they'd say,
hey, come in, figure out our system and build us some
dashboards. And I have leaders right now that are asking for dashboards.
And I'm trying to explain, I'm saying, well, do you actually want
a dashboard, or are you asking me if the team understands their service?
They're not the same, but they actually do get conflated pretty frequently.
I think that's that real difference that's emerging.
That's an amazing point. Bringing me to my next question, like, what are some
ways that we can validate that quality of our data? We can say, hey,
we're instrumenting, we're getting all of our different resources.
We have some attributes that we can look at, but how do you
know that the work is working?
Because you're not moving very slowly. I think it's nuts,
too. Another interesting thing that is sort of a characteristic of the 1.0
to 2.0 shift, which is that if all we do is stuff
the same data into a faster storage engine and stuff with
more like that will. It will be better, but will be kind of a lost
opportunity, because I feel like you can't run a
business with metrics right on the business side of the house. They have had nice
things for a very long time. They've had these columnar stores. You can slice and
dice, you can break down a group by, you can zoom in, you can zoom
out, you can ask all these fancy questions about your marketing or
your click throughs, or like, your sales conversions and all stuff. It's only,
it's like the cobbler's children have no shoes. It's only on the systems
and application side that we're like, we're just going to get by on starvation rations,
thank you very much. But, like, the future, I think,
as, you know, as it becomes increasingly uncool for engineers to just be
like, I don't care about the business. All I do is care about the tech,
right? Like, we know that doesn't work anymore, and yet we're still siloing
off business data from application data, from systems data,
when all of the interesting questions that you need to ask are a combination.
They're a combination of systems, user data,
how users are interacting with your systems, application data, and business.
Right? And, like, when you have all of these different
teams using different tools that are siloed off from each other.
That's why shit gets so slow. How are we going to know when we're doing
well, because we can move fast. And that sounds kind of like
hand wavy, but it's actually the only thing that
matters. You can't run fast if you're dragging a.
A boat anchor behind you, right? And if
you don't have good tools to understand where you are and where you're going next,
might as well just tie that boat anchor up. And I'd say the
quality of the data, too. Like, you know, you can instrument your code
full of crap. I think, you know, a mistake that a lot of people make
when they start instrumenting their code is like, auto instrumentation
is salvation. Right? That's awesome. It's a great starting point, but, like,
you'll learn very fast that it produces a lot of crap that you don't need.
And now you have to wade through the crap. And I think when you get
to the point in your instrumentation where, like,
say, you write your code, you hand
it over to QA, if your qas can start to
figure out why a problem is occurring when they're running
their tests and go back to the developer and say, hey, I know why you've
instrumented your code sufficiently. If they can't figure out what's going on,
then it's also an indicator you haven't instrumented your code sufficiently,
so we can actually help to improve the process
earlier on before it gets to our sres,
before they're like, what's going on? I don't know what's going
on. Yeah. And, well, post QA, where you want to be is
you want that observability loop. And actually, that's how
you tell if it's working is because every time you make a change,
if you go out and look at your data, you've now one gotten value from
that data right away. And then two,
you know that your stuff is working not just the code you
wrote, but also your observability system. And so you're just kind of continually
evaluating that. And that's, I think, part of the job now.
Right. Unless you're still in like, a waterfall shop,
which is still like, honestly, the vast majority of tech.
Yeah. In a way that, like, 15 years
ago, we were all getting used to going from a model where my code
compiles, it must be good over to, like,
my test pass. That's how I know it's good, you know,
now I think we're moving into the world where you don't know if it's good
if your test pass or not. All you know is that your
code is probably, logically, the magic smoke is still inside
the box. Right. That's what I tell people even back at the beginning.
Like, the first CI CD system I deployed, it was using smoke or smoke
test. It was a python thing way early. Um,
right. But the reason it was called that is from an old ee term,
um, which is called the smoke test, which is the first time you power up
a circuit, you put power on it. You don't care about the inputs
or outputs. You just see if the magic smoke comes out. Right. If the
chip goes and you get a little puff of smoke, you failed to smoke test.
Right. And most of our unit tests are basically smoke tests. Right. We're applying some
power and see if the magic smoke comes out. It gets a little bit higher
value. Once you get into functional testing or, or full integration testing,
maybe in a staging environment, nothing will ever behave
the same as production because there are so many freaking parameters
that we can never reproduce. Because production isn't production from
moment to moment either. People should
never ship a diff without validating it's
doing what I expect it to do and nothing else looks weird.
Do you all have some tips for folks to get comfortable on
that data that they're observing? Just to get more comfortable looking
around at it? Or apart from doing the iteration of I'm
going to go instrument something as coding c shows up. But, like,
how do you actually understand what that is and, like, how that plays into,
like, the bigger mental model of a system? Look at it.
Yeah, I was gonna say I have a product feature I use, which is every
once in a while, I go into our honeycomb and I go into
the searches everybody else has run, and I go snoop in
on their searches. And so I can see, like, what other people are looking,
looking at. And then often they'll be like, oh, that one's interesting. Like, what problem
are they working on? I'll dm them later if it's really interesting. Be like,
hey, what was that? But that's, that's one of my favorites right
there. It's like because I can see what other teams are doing, too. Yeah.
I mean, so much of this is about knowing
what good looks like and, you know, that this is why I
always pair is it doing what I expect it to do? And does anything else
look weird? And there's something that's irreducible about that,
you know, because you don't know what you don't know. But if
you make a practice of looking at this, you'll start to know if there's
something that you don't know there. And this is where,
like, I feel like another of the shifts from 1.0 to 2.0 is a shift
from known unknowns to unknown unknowns, which parallels
the shift from going from all of your aggregating at right time,
where you're like, okay, future me is only ever going to want to ask these
questions. So I'm defining them now, collecting that data. Future me
is very average. I don't know
what's going to happen, right. So I'm just going to collect the telemetry I can,
I'm going to store the raw results and I'm going to have the ability to
slice and dice and breakdown and group by and explore and zoom in and zoom
out and all this other stuff because you don't know what you don't know.
And so much of debugging boils down to this very simple,
simple loop, which is, here's the thing I care about.
Why? Why do I care about it? Which means how is it different from everything
that I don't care about it? One other thing that
I would add to this is like, and it happened to me as like
an aha moment I was, I think I added something to the open
telemetry demo, the community demo.
And I'm like, how do I know what's going on? I'm like, wait,
I can instrument it.
And I think, like, if you can instrument your code
and then use whatever observability backend to
understand what's going on,
then I think that's already like, you know, like developers
have to, developers should get into the habit of
debugging their own code by using the instrumentation that
they add to their code. Yep. Which is
right. The ultimate answer here is like,
it's just like playing piano or playing any music instrument or
really any kind of detailed skill, right? You have to do the
thing to get good at the thing. That's, it's neuroplasticity,
right? Like your brain builds that ability to reason and make the intuitional
jumps by doing the practice.
And the other neat thing about this though is when I say it that way,
to a lot of people it sounds like, oh my God, it's going to take
as long as learning the piano. But the good news is it doesn't because
you can learn the piano. Most people can if you play for ten minutes a
day. If you just do it every day. That's where I
think for developers who want to get really good at their jobs, and I don't
mean just observability, but their job in general, if they're getting
in, looking at their metrics, looking at their traces every day that they're
working, or at least once a week would be a huge improvement for a lot
of folks. They'll get that practice, the practice
of reasoning. It's not about the fingers, it's about
your thought process and burning that in. And to
me, this is what defines a senior engineer, someone who can
follow and understand their code in
the real world environment. And it's not as hard as with the
right tools. It's another like shift from 1.02 window I think is
shifting from tools where like the substrate was low
level hardware counters.
Like, here's how many kinds of ram we have, here's the cpu's we have.
Okay, cool. Can't you just ship some code, tell me whether
it's working or not. It's like, I don't think
that's actually reasonable. But like, when you have tooling for software engineers, it speaks
in the language of endpoints and APIs and functions
and variables. Then you should be able to understand whether or not your code
is working. Absolutely. I love that
topic too, of just getting them more comfortable and changing that developer mindset.
I know so many people that just code in the dark
room and they never figure out how it actually translates to.
We like figuring things out, we like understanding things,
we like solving the puzzle, we like doing a good job. I feel
like you have to punish people so much
to like to like shake that curiosity
out of them that it's such a, such a pity whenever you see
that, you're like, oh, honey, what happened to you? Right?
Because we naturally want this stuff. It gives us great dopamine
hits. It's like, ah, I found it. Oh, I fixed it. People are using this
like, there's nothing better in the world.
Well, I'm gonna jump gears for us. I have some questions
around buzzwords and, like, new trends.
How is the AI movement helping or not helping people
in their observability journey?
Oh, that's a tough one. What are the first spicy takes on this?
I would say that I think, like, should I go?
Yeah, sorry.
Network lag. You can start. Adriana. I was going to say sorry
again. Am I?
Oh, my God. So sorry. Editing.
I was going to say, I think a lot of people,
like, pin AI as a salvation,
and we got to, like, chill.
I see AI as an assistive technology, your buddy,
where it's like, hey, you might want to look at this,
but ultimately it's you, the human, who's going to say,
all right, AI, interesting. But you're full
of crap, or, all right, AI, interesting.
This warrants, you know, further investigation. That's the
way I see it. But I think people trying to, like,
I don't know, just make AI,
like, front and center the star of
all the things. Like, gotta chill.
So my spiciest take is that there are a lot of observability
vendors in the space who are slathering AI on their
products to try and patch over the fact that they don't have any connective
data on the back end. They're like, well, shit. What we really want
is just to be able to compute outliers and stuff, but we didn't store that
data. So what if we just stab around in the dark and see
if we can guess? Like, when it
comes to. So there are determines. To be honest, that's some people's whole
personality charity. I mean,
yeah, you're right. There are
deterministic problems and there are non deterministic problems, and I think that
there's a lot of promise for generative AI when it comes to non deterministic problems.
Like. Like the sort of messy work of, you know,
every engineer knows their corner of the system intimately,
but you have to debug the entire thing. So maybe we can use generative
AI to sort of, like, learn how experts are integrate or are
interacting with their corner of the system and use it to bring up everyone
to the level of the best debugger in every corner of the system. That seems
like a reasonable use or, like, the honeycomb use cases, which is natural
language. Ask questions about your system and natural language, awesome. But when
it comes to, like, deterministic things, like,
here's the thing I care about. How is it different from everything? You don't want
AI for that, you want computers. For that, you want it to compute
and tell you what's different about the thing.
You care about what you knew very quickly and easily if you
have simply captured the connective tissue. So I
also feel like there's a lot of, like, anything that's been shipped under the
AI ops. I don't think I've ever heard anything, anyone say anything
genuinely good about any of it. So, like,
double exponential smoothing is a great forecasting algorithm,
and I can call AI if I want to.
That's true. In that case, we. We use AI to
bubble up insights.
You got to get your marketing on that. Charity, got to get our marketing on
that. I think that there's some good stuff there.
But the other thing about generative AI is, at this point, we're really
just using it to generate a lot of lines of code. Like, this guy I
work with has this great quote about, it's like having a really energetic junior engineer
who types really fast, which is like, okay,
but writing code has never been the hard part of software,
and it will never be the hard part of software. And in some ways,
we're just making the hard parts of software much harder, even harder,
by generating so much shitty code.
Yeah. I wanted to touch on something Adriana said,
which I think you said, for a solution. And so what
I think people are missing is that, like, what are the AI people
desperate for? And what I think a lot of people are actually
desperate to solve is, I mentioned earlier, complexity,
right? Is, over the last few years, the complexity of everything that we
build and operate has gotten out of control.
Ten years of free money and whatever else happened, right? We just built these stacks.
We've got so much going on, and people
are struggling, people who maybe weren't there for the
whole journey, people just younger than me and charity that
are just looking at how much stuff there is and going, well, where do I
even start, right? And I think leaders are
doing this, too, and that's why the sales pitches are so damn effective, is because
they're sitting there going, like, how do I grapple with this huge monstrosity
I built and that I'm responsible for and I'm accountable for,
which is what they really care about, and find answers.
And so this is the age old human behavior, right? We look for answers
anywhere we think we might find them, whether it's a guru or it's
a machine that's babbling out the next syllable that it thinks you want to hear.
Still the same human desire and drive to find answers
for, like, what the hell is going on? Makes a lot of
sense. It's easy to just slap AI to bring
in more numbers or just bring in more clicks anywhere. So definitely it's
a proceed with caution, but I definitely, really like the example
of using natural language to answer debugging questions
where it's like you just don't know where to look or what operation you need
to call for. But there is a little helper. I keep calling AI kind of
like clippy. So it's like always thinking out of it as a little helper
that you're dragging on, but, like, they're not going to do everything for you.
I do think that, like, we need to be really mindful of the fact that
AI systems are more complex
than anything else that we've built. And when it comes to understanding
complex systems, there are two kinds of tools. There are tools that try
to help you grapple with that complexity, and there are tools that
obscure that complexity, which is why you need
to be really careful which end you're pointing this insanely
complex tool at. Because if you point a tool you can't understand in
a direction that makes everything else harder to understand, you've kind
of just, you should just give up. You know,
you really want to make sure that you're pointing tools that are designed to help
you deal with that complexity in
a way that leads you to answers, not impressions.
There are some workloads where impressions are fine,
right? Or where, like, close enough is fine and
it's good enough and it's, you know, we'll just hit reload and try again.
And then there are some that aren't. And it really behooves you to know the
difference in terms of other topics that everyone's
buzzing about, platform engineering. How do you see
the relationship between observability and platform engineering evolving?
This is such a good question. I mean, I feel like.
I feel like platform engineering. So there are two
things that really predict a really good honeycomb customer right now. One of them is
companies that can attach dollar values to outcomes like delivery companies,
banks, whatever, every request, you know, it matters. You know you need to understand
it. And the second one that I've been thinking about recently is
right now, we are not super successful with teams that are really
firefighting, you know, ones that can't put their
eyes up and plan over the long term. We certainly
hope to be able to help those teams, but right now, the ones that team
to be really successful with honeycomb are the ones who, a lot
of them are on the vanguard of the platform engineering thing
because they have enough of an ability to get their head up to
see what's coming, which is even more complexity, and to
try and grapple with the complexity they've got, you know,
and the platform engineer, like, the real innovation that platform
engineering is brought is the principles of product
and design based development to infrastructure.
And the idea that, okay, we're putting software engineers
on call for their code. Now someone needs to support them in making that
attractable problem. And so your customer
is internally other engineers, it's not customers out there.
And I think that those two things, not every team that calls themselves a platform
engineering team has adopted those two things. But the teams
that have, whether or not they call themselves platform engineering teams,
are really turning into force multipliers and
like accelerators for the companies that they're at. I think
I have a different theory why that's working so well is my
favorite thing to tell people about what's missing from the Google SRE book, because people
come to me all the time. They're like, I've been trying new SRE for years.
Like, why isn't it working? And the primary difference, the thing that wasn't
in the book was there's a dude high up in
the Google leadership team who ran SRE and he
sat with the vps, he sat at the VP table and then the SVP table.
That's why it worked. Had nothing to do with any of that other crap,
right? Like the good ideas, right? But like,
there's good tech, there was good ideas. Slos are a good idea, right? But like,
the reason slos worked is because there was executive support
to go do the damn things, right? And so what I think we're seeing
with platform engineering is like, platform engineering often emerges when some leader
takes that position of power, right? And often that comes with the
responsibility to act like a product, you know, and that's how you get the buy
in. That's how you build up that social power that lets you go and do
the things that you want to do. Like, hey, let's go out to
all the teams and have them implement open telemetry. That would be great. That would
set us, put us up above waterline so we could actually start thinking
again. But actually getting that moving, like Cherry was talking
about with those teams that are stuck in firefighting mode, they do not have the
power to escape it. As soon as
somewhere in that organization, the thing clicks things
in, the leadership team click and they go, oh, we actually have to care about
availability. Let's do platform engineering. We'll hire a director of platform
engineering. That person shows up and starts advocating and finally there's
advocacy for infrastructure up in the leadership table. And I think that's the
real reason why it works. That is such a fascinating theory
and I might actually buy it or a lot of it.
I think that maybe another way of looking at it is just that
historically it's so easy for senior leaders,
especially those who don't have engineering backgrounds, to feel like
the only valuable part of engineering is features and like
building product that we can sell. And there's this whole
iceberg of shit that if you don't do
it is going to bite your ass. And the
longer you go without doing it, the more it's going to bite you,
you know? And like if you don't have a leadership chain that understands
that and will give you air cover to do
it, then you're driving your company into a ditch.
I, so like most of them though, I experienced
this firsthand at a couple of places. Like I worked at a place where
they were doing a DevOps transformation and
at the end of the day, you know, it looked like there
was executive buy in and at the end of the day and it ended
up not being because when we actually tried to do anything meaningful
and especially like we got great at creating, you know, CI pipelines,
but when it came time to deliver the code, the folks in ops were like,
no, I want to do things manually. I want to
make overtime pay by restarting services manually
because, you know, I'm monitoring emails
all night long and the executive in charge, somebody's job depends
on it being bad. Exactly. And so the executive in charge of
this transformation was more worried about like protecting
her employees feelings and actually executing the
change. Exactly. And it was,
you know, they're still kind of stuck as
a result. And, but this isn't an isolated story. I mean you see that so
much across our industry and so,
yeah, if you don't have that buy in from executives, no worky.
And, but also like if you don't have that convergence from, from, you know,
the, the ics. Right. There's so many
ways this can fuck up. There are so many ways.
Like it's like the Dostoevsky quote about how every
unhappy family is unhappy in its own special way, right?
Yeah, every, every transformation story can fuck up its own
special way. It's really hard to do. There are countless
potholes and foot guns. The other one DevOps
transformations particularly I've seen run up against is often
we sell the program and the transformation on, hey, you will get higher velocity out
the end. And so people are like, oh, I like higher velocity. That means more
features. Right? Um, so they don't actually understand the total cost.
Like charity was saying, right, where there's this under the water,
all this work that has to be done before you get that velocity. And in
the meantime, nobody's willing to give up the predictability of
waterfall model. Yep.
Well, that brings us very close to time. And for the last question that I
have, I was going to see what advice you all have for senior leaders that
are starting on their observability journey, especially with that buy
in. Like, how do you be the leader that is a champion for observability
and really empower your team to do it? I want to end
on an optimistic note, which is that this
can be done. And in fact, one of the reasons it's been challenging at Honeycomb
is that it can be done by people from so many different,
like, we've seen this champion by vps and ctos and directors
and managers and principal engineers and staff engineers, and senior engineers
and intermediate engineers, people who, you know,
on the IC side, who are willing to just, like, roll up their sleeves,
go prove people wrong, build something, try it, get buy in
by, you know, elbow grease. Because the good, the good
part is it fucking works.
And it's not that complicated and it pays off fast.
And there are people out there who are really excited about it who will help
you, you know, and so, like, well, this can sound
really scary and demoralizing everything. It's not like one of those things that takes
two or three years before you see sunlight. It's like
next week in some cases, you know,
it's like, you know, and the better your observability get,
the better. The easier everything gets, the easier it is to start
shortening your pipeline, to start making your tests run faster,
to start, you know, but it is that sort
of dark matter that glues the universe together.
And the better your observability is, it's a prerequisite and a dependency
for everything else you want to do. So if you're someone who's like, wants to
make a huge impact, wants to leave your mark on a company, wants to build
a career, I can't honestly think of a better,
like, more dependable place to start to just like, start knocking off
wins. I guess my, my advice would be is,
you know, well, I already said my thing about platform
engineering, right? So you need to empower the people running your infrastructure
somehow to get their voice to the table so that you understand,
like, why these things matter and why your, and how your customers are suffering,
because that's ultimately what happens when we don't pay attention to these things, is our
customers have a bad time. And then I guess in the business
world, like, what we really care about is attrition, right? But, like, if you want
to bring back the attrition or, or just maybe have
a delightful customer experience, and then all this stuff that we just talked about matters
a lot and the place to start. And this worked for
me. Like, that's what I, we, what I've done at Equinix is put open
telemetry in everything and you can pick the rest of
the tools later. That's the beauty of it, right? You can lay it into all
the code, leave it inert, even, and you can turn it on when
you're ready for it, right. So you don't have to do it all at once.
It's not like a two year project raw open telemetry.
It's pick a service instrument. It pointed
at something, see what it looks like, do another service, right. And you can chew
this off. And it's not that hard to break down from an engineering standpoint.
So the, the most, the hardest part is that first service. And so just go
do it. Totally agree. I would also add,
like, be the squeaky wheel,
nag them to death, nag your leadership,
nag your fellow engineers and just show them the value
of observability,
honestly, because they don't get it. You want to
work there anyway, right? Yeah, yeah,
exactly. Yeah. I completely agree.
The other thing I would add is, like, let's instrument
our pipelines, our CI CD pipelines as well.
We should get some TLC. I wrote a ton.
Oh, you do. Oh, yeah, you do. You do.
You definitely.
Yeah. Because, I mean, those, if your CI CD pipeline
is broken, then your code delivery velocity stopeth.
You know, the last time we did this panel, you asked the same question
at the very end, and, which I love. It's a great question to end
on. And I think that time I said,
raise your expectations, raise your standards. Things can be
so much better. And I also want
to point out that still holds true. Software can be such
a magical, rewarding fun.
Like, we get paid to solve puzzles all day,
and if you're not enjoying yourself,
something is wrong. And if you're working
for a company that makes you miserable, maybe leave it.
If you're working on a stack that is really frustrating to work
with, maybe instrument it because it's really
a smell. It shouldn't be awful.
I love that little nice wisdom, like,
make it better.
Well, with that, we would love to thank all of our attendees for
coming out to our panel. Hopefully, you got to learn a little bit more about
observability and how you can take this into your journey,
your organization, and embark on observability 2.0 with us.
With that, thank you very much to all of our panelists for coming here today.
Thank you, charity, Amy and Adriana with that. We'll see you next time.
Find those online.