Observability 2024 Panel Discussion

Video size:

Abstract

What is the state of observability in 2024? Join the discussion with Ana Margarita Medina (Lightstep), Adriana Villela (CNCF Ambassador), Amy Tobey (Equinix) and Charity Majors (honeycomb).

Summary

Panelists talk about observability, the future of it, and what we can do with current tooling. What an amazing group of folks that we have here, from all walks of journey of observability.
How did you all get started on observability? For me, I got into observability because of charity. The idea is to make more people like me able to be more like Amy. This is all stuff we've been doing throughout, but we're getting better at it.
A common thing people ask me is, hey, teach me to do that thing you do in incidents. It's really freaking hard to teach. People need something to start on. They need some way to start going down that journey instead of being dumped in a sea of, as charity put it, like, context less metrics.
I had a question around where observability is headed. I think it's just going to keep growing more and more towards tracing. Hopefully we'll see a little less emphasis on metrics and logs over the next like five, five ish years.
The difference between observability 1.0 and 2.0 is Tools have many sources of truth. Where you've got metrics, stop. sending money and engineering cycles into the money hole. Reallocate that to logs. Make them as wide as you can.
You can't run a business with metrics right on the business side of the house. A mistake a lot of people make when they start instrumenting their code is like, auto instrumentation is salvation. Post QA, where you want to be is you want that observability loop.
Developers should get into the habit of debugging their own code by using the instrumentation that they add to their code. If you make a practice of looking at this, you'll start to know if there's something that you don't know there. The ultimate answer here is like playing piano or playing any music instrument.
Adriana: How is the AI movement helping or not helping people in their observability journey? She says a lot of people, like, pin AI as a salvation, and we got to chill. Adriana says at this point, we're really just using it to generate lines of code.
The complexity of everything we build and operate has gotten out of control. We look for answers anywhere we think we might find them, whether it's a guru or a machine. When it comes to understanding complex systems, there are two kinds of tools. Be careful which end you're pointing this insanely complex tool at.
How do you see the relationship between observability and platform engineering evolving? Teams that have, whether or not they call themselves platform engineering teams are really turning into force multipliers and like accelerators for the companies that they're at.
How do you be a champion for observability and really empower your team to do it? The better your observability get, the better. The easier everything gets, the easier it is to start shortening your pipeline. The hardest part is that first service.
We would love to thank all of our attendees for coming out to our panel. Hopefully, you got to learn a little bit more about observability and how you can take this into your journey, your organization, and embark on observability 2.0.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey, y'all, thank you comp 42 observability, for having us today. We're going to be having a panel around observability, the future of it, and what we can do with current tooling and what we want for the next tooling. So to get started, I have an amazing group of panelists here. Let's get started with Adriana on introducing themselves. Who are you and what do you do? Hey, I work with Ana. My name is Adriana Villela. We work together at Servicenow cloud observability. Let's say that five times fast. Yeah, I'm a senior staff developer advocate and I talk about all things observability and I spend most of my time in the hotel end user sig, where I am a maintainer there. How about you, Amy? I'm Amy Toby. I'm a senior principal engineer at Equinix. I've been there for three and a half years or something like that. And I kind of started off doing open telemetry there. And these days I do reliability across the and write code and all kinds of weird stuff in between the cracks. My name is charity majors. I am co founder and CTO of Honeycomb IO, a company that does observability tooling. What an amazing group of folks that we have here, from all walks of journey of observability, from end user to doing the reliability work to leading companies in this space. So the first question that we have for getting folks a little bit more adjusted to who we are, how did you all get started on observability? Well, I was looking for a word that was not monitoring. And I looked at Wikipedia and I was like, observability, how well can you understand the inner workings of a system just by observing the outsides? That's a pretty cool idea. But, like, honestly, I come from operations, and my entire career I have hated monitoring. Like, I do not think graphically, I don't like. Like, I've always been the person who would, like, bookmark other people's dashboards and graphs. Because I only love debugging. I do not come by this naturally. So the idea is to make more people like me able to be more like Amy. Well, I guess, like, it kind of starts with a long time ago in a land far, far away kind of thing, right? Like charity. I've been doing this for 25 years and change, but I guess observability started probably around the same time charity started saying it. But it's stuff I'd been doing for years before. And I think that's an important point. This is all stuff we've been doing throughout, but we're getting better at it. So the term is relatively new in the consciousness, in tech, but I think what we're actually doing is a better version of what we used to do. So that's kind of my take on it, right? I don't remember exactly when I started using it. It was probably the first time I heard it from charity or somebody else like that. For me, I got into observability because of charity. I stumbled across one of your tweets randomly one night and started following you on Twitter. And then I got like really into it when I was hired to be a manager on an observability team at two cows. That two cows, if anyone remembers those days. I do remember it was that two cows, but they had changed their, their business. So obviously it wasn't like downloading Windows software. They were doing a domain domain wholesale when I joined. So anyway, yeah, so I was hired to manage this observability team and I was like, oh shit, I better know what this thing is if I'm going to manage a team. And so I started like educating myself and asking lots of questions on the interwebs and blogging about it to educate myself and discovered that I really like the stuff. So that's how I got into it. And you were so good at it. Like, I just remember these articles starting to pop out from out of nowhere. Just like, you kept writing about all these things, but it was like so succinct and so well informed and so grounded in actual experience. And I was like, this person knows what they're talking about. This is cool. We should be friends. Thank you. Well, thanks for letting us know how y'all got started. And if now in your journey, like, what is one thing that you wish you can get across? Like, any attendees had? Like, only one thing that you can nail down today. The way you're doing it now is the hard way. Oh my God. Like, I can't. I don't know how many times he's been like, wow, our team hasn't really mastered like 1.0 stuff, so I don't think we're ready for 2.0, which is something we would never say about any other version. I'm not ready for my sequel. Twelve. We haven't found all the bugs in mysql eight yet. No. The way people are doing, going back to Windows three, it's all about the windows three. Yeah. Like the way the level of difficulty of dealing with metrics and cost management and trying to like, because, you know, metrics, time series databases store no connective tissue. They store no context, no relational data. And so when you're trying to understand your systems, the only thing that connects your dashboards and your logs and your traces and your rum and your ap, only thing is you. You sitting here guessing and, like, remembering past outages and having instinctive, like, intuitive leaps. It's really freaking hard. And it can be so much easier. I want charity on that, but I think I'll give it my own flavor, which is most of my career, at least the last 1015 years. A common thing people ask me is, hey, teach me to do that thing you do in incidents, right? And they'll be like, you know that thing where you show up and you pull up a couple of dashboards and you go, it's this thing, right? Because. Because I have this deep mental model of the system, right? And it's so freaking hard to teach. And basically, like, what we end up saying a lot of times is like, well, you kind of have to experience it to build that intuition. That's a terrible answer for teaching anybody anything, right? Like, people need something to start on. They need some way to start going down that journey instead of being dumped in a sea of, as charity put it, like, context less metrics. Now, as we move into this easier mode and we go into the easier way, obviously now I can dump my engineers into, hey, look, this is what your system looks like, and you can click around and explore it and start to see what it's actually doing instead of trying to reason backwards from a bunch of meaningless logs, which actually, weirdly, gets worse with structured logs. For me, I would say I definitely agree with both of you. I think the one thing that I've noticed is, like, people still insist on continuing to build shitty products and not take a moment to instrument their code to understand why their products continue behaving shittily. And so I think that's like, the most frustrating thing that I've observed is like, you know, the analogy I like to use is like, your house is burning down. Are you going to continue building out the bedroom? Because it makes no freaking sense. Okay. But my product manager said I need a bedroom feature, and that's the most thing I could be working on right now. But do you have a front door that works? Yeah. Right? Product said. That's okay, right? I have to focus on the bedroom. We need to win a new bedroom out fast. The fire is the features that we already built. Those. Those are in the past. We don't, we don't, we never look back. That's quite fair. I, I heard some chatter about observability 2.0. I had a question around where observability is headed. Would any of you like to chime in? Yeah. Yeah. Like I think, um, what's going to happen over the next few years is people are going to go through all their applications and they're going to log, and then what's going to happen though is then their CFO is going to call and say what happened? Um, and then, and then people have to come back to the drawing board and start over. And then they're going to start to realize that we need to actually like, put some context into this, uh, bundle up more of these, um, attributes and edges and edges, um, cardinality, right. Like they're the attributes on there and put them in a succinct way so that we get value from that data. Right. Because our CFO's aren't like afraid of spending money. That's their whole job. They love it, actually. What they hate is when you spend money on dumb stuff. And so if you want to make your CFO happy, like don't go put logs on everything, go and use your judgment to put the right signals in the right places. And usually that's easier in this day and age with like open telemetry, auto instrumentation. So if you have auto instrumenter, you do that a few steps and now you've got high signal. So I think that's where things are going, right? Is now that we've freed these SDKs from being like vendor specific, it's no longer like vendor lock in for me to actually integrate them with my application. I can immigrate open telemetry under the, usually the Apache license. So I don't have to worry about licensing, I don't have to worry about vendor lock in. And so things are moving faster, the tools are getting better, the tools you all make at ServiceNow and at Honeycomb and even the open source stuff is all improving rather rapidly over the last few years. So I think it's just going to keep growing more and more towards tracing and hopefully we'll see a little less emphasis on metrics and logs over the next like five, five ish years. Yeah, I mean, so big question. Yeah. So like, the difference between observability 1.0 and 2.0, for those who are kind of new to this space, is 1.0. Tools have many sources of truth, right? The traditional thing is there are three pillars, metrics, logs and traces, which you probably know that none of us think very highly of. But in reality, like, you've got way more than three, right? You've got your apm, you've got your rom, you've got your dashboard, you've got your logs, but you've got structured, unstructured traces, profiling, whatever, many sources of truth, and nothing knits them together except you. In your 2.0 world, you have a single source of truth. It's these arbitrarily wide structured. You can call them logs, column logs, if you want. I'm not opposed to calling them logs, but you have the information so that you can visualize them over time as a trace. You can slice and dice them like logs, you can derive metrics from them, you can derive your dashboards, you can derive your slos, but because there's one source of truth, you can actually ask any arbitrary question of them, which means you can understand anything that's happening. And so, in the near term, what this means is the most impactful thing you can probably do is wherever you've got metrics, stop. Stop sending money and engineering cycles into the money hole. That is, metrics. Reallocate that to logs. And where you've got logs, make sure they're structured. And where you've got structured logs, make them as wide as you can. You want to have fewer loglines, wider ones, because the wider they are, the more context you have, the more connectivity you have, the more your ability to just instantly compute and see outliers and weird things and like, you know, correlate and all this stuff, the more powerful your ability is to understand your systems in general. So while this sounds pretty complicated, I think it's actually really simple. Fewer, wider logwise, if I may add to that, because I love that, and I think it, to me, I interpret it as treating all this. Your main thing is the event, and the event can be made into whatever you need it to be. Anything can be derived from the event. You can correlate on any of the attributes, right? Yeah, exactly. You have your choice at query time what you're going to correlate on. Yeah, exactly like your. But your base, like everything, stems from this event. And then, you know, it can be whatever. But the other thing that I wanted to mention as well is like, treating observability, is not this like afterthought, you know, I think a lot of people have schlepped observability to the end of the SDLC, shifting it as, as right as possible. And we need to really start thinking of observability as a more shift left, like an integrated part of our SDLC. Right. Because guess who's going to instrument the code? It's going to be the developers, but the qas are going to take advantage of that. Who do you have to convince? Who do you have to convince? Oh, that's the trick. I mean you don't have to convince the developers. The developers love this shit. But. Yeah, yeah, yeah. It's usually like they're, it's usually their leadership team that doesn't understand the value. Right. It is obvious and that's usually. Yeah. Anyway, I interrupted you. Oh no, that's fine. No, that's a very fair point. And this is somewhere where I feel like another, like. So there's a super basic level, you know, it's, you've got one source of truth or many, but there are so many socio technical factors and consequences that flow from this fundamental, like the way that you arrange your data on disk change. One of them is I think that observability 1.0 is really about how you operate your code. Right. It's about how you run it. It's kind of the afterthought as you're getting ready to play it. Observability 2.0 is about how you develop your code, which includes operating, of course, but it's the substrate that allows you to hook up these really tight feedback loops and pick up speed. And there's so much what I think of, just like dark matter and software engineering was just like, why is everything so slow? Why you can't do UDa without the first o? It's just ODA, just orient, decide, act like that's. I've worked at companies like that, they haven't really done much, but being able to see what you're doing, being able to move swiftly with confidence, it's like a whole different profession when you, when you, when your development cycle includes professionals, production, like when you look at the door metrics and how year over year the good teams are getting, they're like achieving liftoff and most of us are just getting slightly worse year over year. That's, I attribute a lot of that to the quality of your observability. I think that how I see it, right? Like, because a lot of why it's happening is the increasing complexity of our stacks too. Yeah. And so we got away with it before. 15 years ago, our stacks were limited to a few servers and a few services. And now you can have a easily have a single product that has 50 70 services distributed across the world. And you can't grapple with the complexity like that with disparate tools anymore. Good luck reading lines of code and understanding what it's doing anymore. Like, you can't. If you ever cut, you really can't. Now you can only understand it by including production into your core development loop. Sorry, I was going to say also, like, this is why I get, like, so mad, too, when you have, like, people at companies who are like, let's continue doing dashboards the way that we used to before because it worked for me 15 years ago when it was wildly successful. But what they're saying, though, right? What these folks are reaching for is they, they, there is a way that we've encoded what charity is talking about, like the mental model. So the reason why people would often hire people like me is they'd say, hey, come in, figure out our system and build us some dashboards. And I have leaders right now that are asking for dashboards. And I'm trying to explain, I'm saying, well, do you actually want a dashboard, or are you asking me if the team understands their service? They're not the same, but they actually do get conflated pretty frequently. I think that's that real difference that's emerging. That's an amazing point. Bringing me to my next question, like, what are some ways that we can validate that quality of our data? We can say, hey, we're instrumenting, we're getting all of our different resources. We have some attributes that we can look at, but how do you know that the work is working? Because you're not moving very slowly. I think it's nuts, too. Another interesting thing that is sort of a characteristic of the 1.0 to 2.0 shift, which is that if all we do is stuff the same data into a faster storage engine and stuff with more like that will. It will be better, but will be kind of a lost opportunity, because I feel like you can't run a business with metrics right on the business side of the house. They have had nice things for a very long time. They've had these columnar stores. You can slice and dice, you can break down a group by, you can zoom in, you can zoom out, you can ask all these fancy questions about your marketing or your click throughs, or like, your sales conversions and all stuff. It's only, it's like the cobbler's children have no shoes. It's only on the systems and application side that we're like, we're just going to get by on starvation rations, thank you very much. But, like, the future, I think, as, you know, as it becomes increasingly uncool for engineers to just be like, I don't care about the business. All I do is care about the tech, right? Like, we know that doesn't work anymore, and yet we're still siloing off business data from application data, from systems data, when all of the interesting questions that you need to ask are a combination. They're a combination of systems, user data, how users are interacting with your systems, application data, and business. Right? And, like, when you have all of these different teams using different tools that are siloed off from each other. That's why shit gets so slow. How are we going to know when we're doing well, because we can move fast. And that sounds kind of like hand wavy, but it's actually the only thing that matters. You can't run fast if you're dragging a. A boat anchor behind you, right? And if you don't have good tools to understand where you are and where you're going next, might as well just tie that boat anchor up. And I'd say the quality of the data, too. Like, you know, you can instrument your code full of crap. I think, you know, a mistake that a lot of people make when they start instrumenting their code is like, auto instrumentation is salvation. Right? That's awesome. It's a great starting point, but, like, you'll learn very fast that it produces a lot of crap that you don't need. And now you have to wade through the crap. And I think when you get to the point in your instrumentation where, like, say, you write your code, you hand it over to QA, if your qas can start to figure out why a problem is occurring when they're running their tests and go back to the developer and say, hey, I know why you've instrumented your code sufficiently. If they can't figure out what's going on, then it's also an indicator you haven't instrumented your code sufficiently, so we can actually help to improve the process earlier on before it gets to our sres, before they're like, what's going on? I don't know what's going on. Yeah. And, well, post QA, where you want to be is you want that observability loop. And actually, that's how you tell if it's working is because every time you make a change, if you go out and look at your data, you've now one gotten value from that data right away. And then two, you know that your stuff is working not just the code you wrote, but also your observability system. And so you're just kind of continually evaluating that. And that's, I think, part of the job now. Right. Unless you're still in like, a waterfall shop, which is still like, honestly, the vast majority of tech. Yeah. In a way that, like, 15 years ago, we were all getting used to going from a model where my code compiles, it must be good over to, like, my test pass. That's how I know it's good, you know, now I think we're moving into the world where you don't know if it's good if your test pass or not. All you know is that your code is probably, logically, the magic smoke is still inside the box. Right. That's what I tell people even back at the beginning. Like, the first CI CD system I deployed, it was using smoke or smoke test. It was a python thing way early. Um, right. But the reason it was called that is from an old ee term, um, which is called the smoke test, which is the first time you power up a circuit, you put power on it. You don't care about the inputs or outputs. You just see if the magic smoke comes out. Right. If the chip goes and you get a little puff of smoke, you failed to smoke test. Right. And most of our unit tests are basically smoke tests. Right. We're applying some power and see if the magic smoke comes out. It gets a little bit higher value. Once you get into functional testing or, or full integration testing, maybe in a staging environment, nothing will ever behave the same as production because there are so many freaking parameters that we can never reproduce. Because production isn't production from moment to moment either. People should never ship a diff without validating it's doing what I expect it to do and nothing else looks weird. Do you all have some tips for folks to get comfortable on that data that they're observing? Just to get more comfortable looking around at it? Or apart from doing the iteration of I'm going to go instrument something as coding c shows up. But, like, how do you actually understand what that is and, like, how that plays into, like, the bigger mental model of a system? Look at it. Yeah, I was gonna say I have a product feature I use, which is every once in a while, I go into our honeycomb and I go into the searches everybody else has run, and I go snoop in on their searches. And so I can see, like, what other people are looking, looking at. And then often they'll be like, oh, that one's interesting. Like, what problem are they working on? I'll dm them later if it's really interesting. Be like, hey, what was that? But that's, that's one of my favorites right there. It's like because I can see what other teams are doing, too. Yeah. I mean, so much of this is about knowing what good looks like and, you know, that this is why I always pair is it doing what I expect it to do? And does anything else look weird? And there's something that's irreducible about that, you know, because you don't know what you don't know. But if you make a practice of looking at this, you'll start to know if there's something that you don't know there. And this is where, like, I feel like another of the shifts from 1.0 to 2.0 is a shift from known unknowns to unknown unknowns, which parallels the shift from going from all of your aggregating at right time, where you're like, okay, future me is only ever going to want to ask these questions. So I'm defining them now, collecting that data. Future me is very average. I don't know what's going to happen, right. So I'm just going to collect the telemetry I can, I'm going to store the raw results and I'm going to have the ability to slice and dice and breakdown and group by and explore and zoom in and zoom out and all this other stuff because you don't know what you don't know. And so much of debugging boils down to this very simple, simple loop, which is, here's the thing I care about. Why? Why do I care about it? Which means how is it different from everything that I don't care about it? One other thing that I would add to this is like, and it happened to me as like an aha moment I was, I think I added something to the open telemetry demo, the community demo. And I'm like, how do I know what's going on? I'm like, wait, I can instrument it. And I think, like, if you can instrument your code and then use whatever observability backend to understand what's going on, then I think that's already like, you know, like developers have to, developers should get into the habit of debugging their own code by using the instrumentation that they add to their code. Yep. Which is right. The ultimate answer here is like, it's just like playing piano or playing any music instrument or really any kind of detailed skill, right? You have to do the thing to get good at the thing. That's, it's neuroplasticity, right? Like your brain builds that ability to reason and make the intuitional jumps by doing the practice. And the other neat thing about this though is when I say it that way, to a lot of people it sounds like, oh my God, it's going to take as long as learning the piano. But the good news is it doesn't because you can learn the piano. Most people can if you play for ten minutes a day. If you just do it every day. That's where I think for developers who want to get really good at their jobs, and I don't mean just observability, but their job in general, if they're getting in, looking at their metrics, looking at their traces every day that they're working, or at least once a week would be a huge improvement for a lot of folks. They'll get that practice, the practice of reasoning. It's not about the fingers, it's about your thought process and burning that in. And to me, this is what defines a senior engineer, someone who can follow and understand their code in the real world environment. And it's not as hard as with the right tools. It's another like shift from 1.02 window I think is shifting from tools where like the substrate was low level hardware counters. Like, here's how many kinds of ram we have, here's the cpu's we have. Okay, cool. Can't you just ship some code, tell me whether it's working or not. It's like, I don't think that's actually reasonable. But like, when you have tooling for software engineers, it speaks in the language of endpoints and APIs and functions and variables. Then you should be able to understand whether or not your code is working. Absolutely. I love that topic too, of just getting them more comfortable and changing that developer mindset. I know so many people that just code in the dark room and they never figure out how it actually translates to. We like figuring things out, we like understanding things, we like solving the puzzle, we like doing a good job. I feel like you have to punish people so much to like to like shake that curiosity out of them that it's such a, such a pity whenever you see that, you're like, oh, honey, what happened to you? Right? Because we naturally want this stuff. It gives us great dopamine hits. It's like, ah, I found it. Oh, I fixed it. People are using this like, there's nothing better in the world. Well, I'm gonna jump gears for us. I have some questions around buzzwords and, like, new trends. How is the AI movement helping or not helping people in their observability journey? Oh, that's a tough one. What are the first spicy takes on this? I would say that I think, like, should I go? Yeah, sorry. Network lag. You can start. Adriana. I was going to say sorry again. Am I? Oh, my God. So sorry. Editing. I was going to say, I think a lot of people, like, pin AI as a salvation, and we got to, like, chill. I see AI as an assistive technology, your buddy, where it's like, hey, you might want to look at this, but ultimately it's you, the human, who's going to say, all right, AI, interesting. But you're full of crap, or, all right, AI, interesting. This warrants, you know, further investigation. That's the way I see it. But I think people trying to, like, I don't know, just make AI, like, front and center the star of all the things. Like, gotta chill. So my spiciest take is that there are a lot of observability vendors in the space who are slathering AI on their products to try and patch over the fact that they don't have any connective data on the back end. They're like, well, shit. What we really want is just to be able to compute outliers and stuff, but we didn't store that data. So what if we just stab around in the dark and see if we can guess? Like, when it comes to. So there are determines. To be honest, that's some people's whole personality charity. I mean, yeah, you're right. There are deterministic problems and there are non deterministic problems, and I think that there's a lot of promise for generative AI when it comes to non deterministic problems. Like. Like the sort of messy work of, you know, every engineer knows their corner of the system intimately, but you have to debug the entire thing. So maybe we can use generative AI to sort of, like, learn how experts are integrate or are interacting with their corner of the system and use it to bring up everyone to the level of the best debugger in every corner of the system. That seems like a reasonable use or, like, the honeycomb use cases, which is natural language. Ask questions about your system and natural language, awesome. But when it comes to, like, deterministic things, like, here's the thing I care about. How is it different from everything? You don't want AI for that, you want computers. For that, you want it to compute and tell you what's different about the thing. You care about what you knew very quickly and easily if you have simply captured the connective tissue. So I also feel like there's a lot of, like, anything that's been shipped under the AI ops. I don't think I've ever heard anything, anyone say anything genuinely good about any of it. So, like, double exponential smoothing is a great forecasting algorithm, and I can call AI if I want to. That's true. In that case, we. We use AI to bubble up insights. You got to get your marketing on that. Charity, got to get our marketing on that. I think that there's some good stuff there. But the other thing about generative AI is, at this point, we're really just using it to generate a lot of lines of code. Like, this guy I work with has this great quote about, it's like having a really energetic junior engineer who types really fast, which is like, okay, but writing code has never been the hard part of software, and it will never be the hard part of software. And in some ways, we're just making the hard parts of software much harder, even harder, by generating so much shitty code. Yeah. I wanted to touch on something Adriana said, which I think you said, for a solution. And so what I think people are missing is that, like, what are the AI people desperate for? And what I think a lot of people are actually desperate to solve is, I mentioned earlier, complexity, right? Is, over the last few years, the complexity of everything that we build and operate has gotten out of control. Ten years of free money and whatever else happened, right? We just built these stacks. We've got so much going on, and people are struggling, people who maybe weren't there for the whole journey, people just younger than me and charity that are just looking at how much stuff there is and going, well, where do I even start, right? And I think leaders are doing this, too, and that's why the sales pitches are so damn effective, is because they're sitting there going, like, how do I grapple with this huge monstrosity I built and that I'm responsible for and I'm accountable for, which is what they really care about, and find answers. And so this is the age old human behavior, right? We look for answers anywhere we think we might find them, whether it's a guru or it's a machine that's babbling out the next syllable that it thinks you want to hear. Still the same human desire and drive to find answers for, like, what the hell is going on? Makes a lot of sense. It's easy to just slap AI to bring in more numbers or just bring in more clicks anywhere. So definitely it's a proceed with caution, but I definitely, really like the example of using natural language to answer debugging questions where it's like you just don't know where to look or what operation you need to call for. But there is a little helper. I keep calling AI kind of like clippy. So it's like always thinking out of it as a little helper that you're dragging on, but, like, they're not going to do everything for you. I do think that, like, we need to be really mindful of the fact that AI systems are more complex than anything else that we've built. And when it comes to understanding complex systems, there are two kinds of tools. There are tools that try to help you grapple with that complexity, and there are tools that obscure that complexity, which is why you need to be really careful which end you're pointing this insanely complex tool at. Because if you point a tool you can't understand in a direction that makes everything else harder to understand, you've kind of just, you should just give up. You know, you really want to make sure that you're pointing tools that are designed to help you deal with that complexity in a way that leads you to answers, not impressions. There are some workloads where impressions are fine, right? Or where, like, close enough is fine and it's good enough and it's, you know, we'll just hit reload and try again. And then there are some that aren't. And it really behooves you to know the difference in terms of other topics that everyone's buzzing about, platform engineering. How do you see the relationship between observability and platform engineering evolving? This is such a good question. I mean, I feel like. I feel like platform engineering. So there are two things that really predict a really good honeycomb customer right now. One of them is companies that can attach dollar values to outcomes like delivery companies, banks, whatever, every request, you know, it matters. You know you need to understand it. And the second one that I've been thinking about recently is right now, we are not super successful with teams that are really firefighting, you know, ones that can't put their eyes up and plan over the long term. We certainly hope to be able to help those teams, but right now, the ones that team to be really successful with honeycomb are the ones who, a lot of them are on the vanguard of the platform engineering thing because they have enough of an ability to get their head up to see what's coming, which is even more complexity, and to try and grapple with the complexity they've got, you know, and the platform engineer, like, the real innovation that platform engineering is brought is the principles of product and design based development to infrastructure. And the idea that, okay, we're putting software engineers on call for their code. Now someone needs to support them in making that attractable problem. And so your customer is internally other engineers, it's not customers out there. And I think that those two things, not every team that calls themselves a platform engineering team has adopted those two things. But the teams that have, whether or not they call themselves platform engineering teams, are really turning into force multipliers and like accelerators for the companies that they're at. I think I have a different theory why that's working so well is my favorite thing to tell people about what's missing from the Google SRE book, because people come to me all the time. They're like, I've been trying new SRE for years. Like, why isn't it working? And the primary difference, the thing that wasn't in the book was there's a dude high up in the Google leadership team who ran SRE and he sat with the vps, he sat at the VP table and then the SVP table. That's why it worked. Had nothing to do with any of that other crap, right? Like the good ideas, right? But like, there's good tech, there was good ideas. Slos are a good idea, right? But like, the reason slos worked is because there was executive support to go do the damn things, right? And so what I think we're seeing with platform engineering is like, platform engineering often emerges when some leader takes that position of power, right? And often that comes with the responsibility to act like a product, you know, and that's how you get the buy in. That's how you build up that social power that lets you go and do the things that you want to do. Like, hey, let's go out to all the teams and have them implement open telemetry. That would be great. That would set us, put us up above waterline so we could actually start thinking again. But actually getting that moving, like Cherry was talking about with those teams that are stuck in firefighting mode, they do not have the power to escape it. As soon as somewhere in that organization, the thing clicks things in, the leadership team click and they go, oh, we actually have to care about availability. Let's do platform engineering. We'll hire a director of platform engineering. That person shows up and starts advocating and finally there's advocacy for infrastructure up in the leadership table. And I think that's the real reason why it works. That is such a fascinating theory and I might actually buy it or a lot of it. I think that maybe another way of looking at it is just that historically it's so easy for senior leaders, especially those who don't have engineering backgrounds, to feel like the only valuable part of engineering is features and like building product that we can sell. And there's this whole iceberg of shit that if you don't do it is going to bite your ass. And the longer you go without doing it, the more it's going to bite you, you know? And like if you don't have a leadership chain that understands that and will give you air cover to do it, then you're driving your company into a ditch. I, so like most of them though, I experienced this firsthand at a couple of places. Like I worked at a place where they were doing a DevOps transformation and at the end of the day, you know, it looked like there was executive buy in and at the end of the day and it ended up not being because when we actually tried to do anything meaningful and especially like we got great at creating, you know, CI pipelines, but when it came time to deliver the code, the folks in ops were like, no, I want to do things manually. I want to make overtime pay by restarting services manually because, you know, I'm monitoring emails all night long and the executive in charge, somebody's job depends on it being bad. Exactly. And so the executive in charge of this transformation was more worried about like protecting her employees feelings and actually executing the change. Exactly. And it was, you know, they're still kind of stuck as a result. And, but this isn't an isolated story. I mean you see that so much across our industry and so, yeah, if you don't have that buy in from executives, no worky. And, but also like if you don't have that convergence from, from, you know, the, the ics. Right. There's so many ways this can fuck up. There are so many ways. Like it's like the Dostoevsky quote about how every unhappy family is unhappy in its own special way, right? Yeah, every, every transformation story can fuck up its own special way. It's really hard to do. There are countless potholes and foot guns. The other one DevOps transformations particularly I've seen run up against is often we sell the program and the transformation on, hey, you will get higher velocity out the end. And so people are like, oh, I like higher velocity. That means more features. Right? Um, so they don't actually understand the total cost. Like charity was saying, right, where there's this under the water, all this work that has to be done before you get that velocity. And in the meantime, nobody's willing to give up the predictability of waterfall model. Yep. Well, that brings us very close to time. And for the last question that I have, I was going to see what advice you all have for senior leaders that are starting on their observability journey, especially with that buy in. Like, how do you be the leader that is a champion for observability and really empower your team to do it? I want to end on an optimistic note, which is that this can be done. And in fact, one of the reasons it's been challenging at Honeycomb is that it can be done by people from so many different, like, we've seen this champion by vps and ctos and directors and managers and principal engineers and staff engineers, and senior engineers and intermediate engineers, people who, you know, on the IC side, who are willing to just, like, roll up their sleeves, go prove people wrong, build something, try it, get buy in by, you know, elbow grease. Because the good, the good part is it fucking works. And it's not that complicated and it pays off fast. And there are people out there who are really excited about it who will help you, you know, and so, like, well, this can sound really scary and demoralizing everything. It's not like one of those things that takes two or three years before you see sunlight. It's like next week in some cases, you know, it's like, you know, and the better your observability get, the better. The easier everything gets, the easier it is to start shortening your pipeline, to start making your tests run faster, to start, you know, but it is that sort of dark matter that glues the universe together. And the better your observability is, it's a prerequisite and a dependency for everything else you want to do. So if you're someone who's like, wants to make a huge impact, wants to leave your mark on a company, wants to build a career, I can't honestly think of a better, like, more dependable place to start to just like, start knocking off wins. I guess my, my advice would be is, you know, well, I already said my thing about platform engineering, right? So you need to empower the people running your infrastructure somehow to get their voice to the table so that you understand, like, why these things matter and why your, and how your customers are suffering, because that's ultimately what happens when we don't pay attention to these things, is our customers have a bad time. And then I guess in the business world, like, what we really care about is attrition, right? But, like, if you want to bring back the attrition or, or just maybe have a delightful customer experience, and then all this stuff that we just talked about matters a lot and the place to start. And this worked for me. Like, that's what I, we, what I've done at Equinix is put open telemetry in everything and you can pick the rest of the tools later. That's the beauty of it, right? You can lay it into all the code, leave it inert, even, and you can turn it on when you're ready for it, right. So you don't have to do it all at once. It's not like a two year project raw open telemetry. It's pick a service instrument. It pointed at something, see what it looks like, do another service, right. And you can chew this off. And it's not that hard to break down from an engineering standpoint. So the, the most, the hardest part is that first service. And so just go do it. Totally agree. I would also add, like, be the squeaky wheel, nag them to death, nag your leadership, nag your fellow engineers and just show them the value of observability, honestly, because they don't get it. You want to work there anyway, right? Yeah, yeah, exactly. Yeah. I completely agree. The other thing I would add is, like, let's instrument our pipelines, our CI CD pipelines as well. We should get some TLC. I wrote a ton. Oh, you do. Oh, yeah, you do. You do. You definitely. Yeah. Because, I mean, those, if your CI CD pipeline is broken, then your code delivery velocity stopeth. You know, the last time we did this panel, you asked the same question at the very end, and, which I love. It's a great question to end on. And I think that time I said, raise your expectations, raise your standards. Things can be so much better. And I also want to point out that still holds true. Software can be such a magical, rewarding fun. Like, we get paid to solve puzzles all day, and if you're not enjoying yourself, something is wrong. And if you're working for a company that makes you miserable, maybe leave it. If you're working on a stack that is really frustrating to work with, maybe instrument it because it's really a smell. It shouldn't be awful. I love that little nice wisdom, like, make it better. Well, with that, we would love to thank all of our attendees for coming out to our panel. Hopefully, you got to learn a little bit more about observability and how you can take this into your journey, your organization, and embark on observability 2.0 with us. With that, thank you very much to all of our panelists for coming here today. Thank you, charity, Amy and Adriana with that. We'll see you next time. Find those online.

See all 22 talks at this event!

Conf42 Observability 2024 - Online

June 13 2024

Observability 2024 Panel Discussion

Video size:

Abstract

Summary

Transcript

Join the community!

Featured event

2025

2024

Info

Conf42 Observability 2024 - Online

June 13 2024

Observability 2024 Panel Discussion

Video size:

Abstract

Summary

Transcript

Join the community!