Conf42 Incident Management 2023 - Online

Leveraging SRE and Observability Techniques for the Wild World of Building on LLMs

Video size:

Abstract

Building on LLMs is magical—but maintaining LLM-backed code is tricky: how do you ensure perf/correctness for something probabilistic?

It’s like building on any black box (eg APIs/DBs). We’ll cover instrumentation techniques, uses for observability best practices, and even using SLOs early in dev.

Summary

  • The techniques that I'll describe today should be transferable to your tool of about this will draw from our experience of building on llms, but should apply to whatever LLM and observability stack you're using today. There's suddenly a lot more demand for AI functionality than there are people who carry expertise for it.
  • RaE Rag is a practice of pulling in additional context within your domain to help your llms return better results. Even product development or release practices are turned a little bit inside out. Observability is a way of comparing what you expect in your head versus the actual behavior, but in live data.
  • Earlier this year we released our query assistant in May 2023. Took about six weeks of development super fast, and we spent another eight weeks iterating on it. Six months later, it's so much more common for us to meet someone playing around with llms than someone whose product has actual LLM functionality deployed in production.
  • The concept most commonly associated with ensuring consistent performance of production systems service level objectives are slos. Often used as a way to set a baseline and measure degradation over time of a key product workflow. Being able to track historical compliance is what allows the team to iterate fast and confidently.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Thanks for joining me today. I'm Christine and I'm going to start with a disclaimer. Honeycomb is an observability tool, but the techniques that I'll describe today should be transferable to your tool of about this will draw from our experience of building on llms, but should apply to whatever LLM and observability stack you're using today. All right, it software in 2023 feels more like magic than it ever has before. There are llms everywhere with a cheap API call to your provider of choice. It feels like every CEO or even CEO is now turning to their teams and asking how llms can be incorporated into their core product. Many are reaching to define an AI strategy, and there's lots to be excited about here. It's cool to be squarely in the middle of a phase change in progress where everything is new to everyone altogether. But there's also a reality check to trying to suddenly incorporate all this new technology into our products. There's suddenly a lot more demand for AI functionality than there are people who carry expertise for it and software engineering teams everywhere sre often just diving in to figure it out because we're the only ones left. Which to be clear, is just fine by me. As someone who used to identify as a generalist software engineer, the fewer silos we can build in this industry, the better. Because on one know, using a large language model through an API is like any other black box you interact with via API. Lots of consistent expectations we can set about how we make sense of these APIs, how we send parameters to the API, what types and scopes those inputs will be, what we'll get back from those APIs, and it's usually done over a standard protocol. And so all of these properties make working with APIs, these black boxes of logic, into something that is testable and mockable, a pretty reliable component in our system. But there's one key difference between having your application behavior relied on an LLM versus, say, a payments provider. That difference is how predictable the behavior of that black box is, which then in turn influences how testable or how mockable it is. And that difference ends up breaking apart all the different techniques that we've built up along the years for making sense of these complex systems. With normal APIs, you can write unit best. With an API you can conceivably scope or predict the full range of inputs in a useful way. On the LLMs side, you're not working with just the full range of negative and positive numbers. You've got a long tail of literally what we're soliciting is free form natural language input from users. We're not going to be able to have a reasonable test suite that we can run reliably for reproducibility APIs. Again, especially if it's a software, as a service, it's something very consistent. You have a payments service, typically when you say debit $5 from my bank account, the balance goes down by $5. It's predictable. Ideally it's item potent, where if you're doing the same transaction, bank account aside, there's no additional strange side effects on the LLM side. The way that many of these public APIs are set up, usage by the public is teaching the model itself additional behavior. And so you have these API level regressions that are happening, you can't control. And as software engineers using that LLM, you need to adapt your prompts. So again, not mockable, not reproducible. And again, with a normal API, you can kind of reason what it's supposed to be doing and whether the problem is on the API's side or your application logic side, because there's a spec, because it's explainable and you're able to fit it in your head. On the LLM side, it's really hard to make sense of some of these changes programmatically, because llms are meant to almost simulate human behaviors. It's kind of the point. And so a thing that we can see is that very small changes to the prompt can yield very dramatic changes to the results in ways that, again, make it hard for humans to explain and debug, and sort of build a mental model of how it's supposed to behave. Now, these three techniques on the left are ways that we have traditionally tried to ensure correctness of our software. And if you ask an ML team, the right way to ensure correctness of something like an LLM feature is to build an evaluation system to evaluate the effectiveness of the model or the prompt. But most of us trying to make sense of llms aren't ML engineers. And the promise of llms exposed via APIs is that we shouldn't have to be to fold these new capabilities into our software. There's even one more layer of unpredictability that llms introduce. There's a concept of, I don't know how familiar everyone is with this piece, but there's an acronym that is used in this world, rag rags or retrieval augmented generation. Effectively, it's a practice of pulling in additional context within your domain to help your llms return better results. If you think about using Chat GPT prompt, it's where you say, oh, do this but in this style, or do this but in a certain voice. All that extra context helps make sure the LLM returns the result that you're looking for. But it is because of the way that these RaE Rag pipelines end up being built. Really, it means that your app is pulling in even more dynamic content and context that can again create and result in big changes in how the LLM is built, how the LLM is responding, and so you have even more unpredictability in trying to figure out why is my user not having the experience that I want them to have? So this turning upside down of our worldview is happening on a literal software engineering and systems engineering level. We know these black boxes aren't testable or debuggable in a traditional sense, so there's no solid sense of correct behavior that we can fall back to. It's also true from a meta level where there's no environment within which we can conduct our tests and feel confident in the results. There's no creating a staging environment where we can be sure that the LLMs experience or feature that we're building behaves correctly or does what the user wants. Going even one step further, even product development or release practices are turned a little bit inside out. Instead of being able to start with early access and then putting your product through its paces and then feeling confident in a later or broader release, early access programs are inherently going to fail to capture that full range of user behavior and edge cases. All these programs do is delay the inevitable failures that you'll encounter when you have an uncontrolled and unprompted group of group of users doing things that you never expected them to do. So at this point, do we just give up on everything we've learned about building and operating software systems and embrace the rise of prompt engineer as an entirely separate skill set. Well, if you've been paying attention to the title of this talk, the answer is obviously not, because we already have a model for how to measure and debug and move the needle on an unpredictable qualitative experience. Observability. And I'll say this term has become so commonplace today, it's fallen out of fashion to define it. But as someone who's been talking about all of this since before it was cool, humor me. I think it'll help some pieces click into place. This here is the formal Wikipedia definition of observability. It comes from control theory. It's about looking at a system based on the inputs and outputs and using that to model what this system is doing. Black box and it feels a little overly formal when talking about production systems. Software systems still applies to, but it feels like really applicable to a system like an LLM, like this thing that's changing over time because it can't be monitored or simulated with traditional techniques. Another way I like to think about this is that less formally, observability is a way of comparing what you expect in your head versus the actual behavior, but in live systems. And so let's take a look at what this means for a standard web app. Well, you're looking at this box has your application. Because it's our application, we actually get to instrument it and we can capture what arguments were sent to it. On any given HTTP request, we can capture some metadata about how the app was running and we can capture data about what was returned. This lets us reason about the behavior we can expect for a given user and endpoint and set of parameters. And it lets us debug and reproduce the issue if the actual behavior we see deviates from that expectation. Again, lots of parallels to best, but on live data. What about this payment service over here on the right? It's that black box that the app depends on. It's out of my control. Might be another company entirely. And even if I wanted to, because of that, I couldn't go and shove instrumentation inside of it. You can think of this like a database too, right? You're not going to go and fork Mysql and shove your own instrumentation in there. But I know what requests my app has sent to it. I know where those requests are coming from in the code and on behalf of which user. And then I know how long it took to respond from the app's perspective, whether it was successful, and probably some other metadata. By capturing all of that I can again start to reason, or at least have a paper trail, to understand how these inputs impact the outputs of my black box and then how the choices my application makes and the inputs into that application impacts all of that. And that approach becomes the same for llms, as unpredictable and nondeterministic as they are. We know how a user interacts with the app, we know how the app turns that into parameters for the black box, and we know how they respond. It's a blanket statement that in complex systems, software usage patterns will become unpredictable and change over time. With llms, that assertion becomes a guarantee. If you use llms, as many of us are, your data set is going to be unpredictable and will absolutely change over time. So the key to operating sanely on top of that magical foundation is having a way of gathering, aggregating and exploring that data in a way that captures what the user experienced as expressively as possible. That's what lets you build and reason and ensure a quality experience on top of llms, the ability to understand from the outside why your user got a certain response from your llmbacked application. Observability creates these feedback loops to let you learn from what's really happening with your code, the same way we've all learned how to work iteratively with tests. Observability enables us all to ship sooner, observe those results in the wild, and then wrap those observations back into the development process. With llms rapidly becoming some piece of every software system, we all get to learn some new skills. SRes who are used to thinking of APIs as black boxes that can be modeled and asserted on, now have to get used to drift and peeling back a layer to examine that emergent behavior. Software engineers who are used to boolean logic and discrete math and correctness and test driven development now have to think about data quality, probabilistic systems and representivity, or how well your model test environment or your staging environment, or your mental code represents the production system. And everyone in engineering needs to reorient themselves around what this LLM thing is trying to achieve, what the business goals are, what the product use cases are, what the ideal user experience is, instead of sterile concepts like correctness, reliability or availability. Those last three are still important. But ultimately, when you bring in this thing that is so free form that the human on the other end is going to have their own opinion of whether your LLM feature was useful or not, we all need to think expand our mental models of what it means to provide a great service to include that definition as well. So, okay, why am I up here talking about this and why should you believe me? I'm going to tell you a little bit about a feature that we released and our experience building it, trying to ensure that it would be a great experience, and maintaining it going forward. So earlier this year we released our query assistant in May 2023. Took about six weeks of development super fast, and we spent another eight weeks iterating on it. And to give you a little bit of an overview of what it was trying to do, Honeycomb as an observability tool lets our users work with a lot of data. Our product has a visual query interface. We believe that point and click is always going to be easier for someone to learn and play around with than an open text box. But even so, there's a learning curve to the user's interface and we were really excited about being able to use llms as a translation layer from what the human is trying to do over here on the right of this slide into the UI. And so we added this little experimental piece to the query. Building collapsed most of the time, but people could expand it and we let people type in in English what they were hoping to SRE. And we also another thing that was important to us is that we preserve the editability and explorability that's sort of inherent in our product. The same way that we all as consumers have gotten used to being able to edit or iterate on our response with Chat GPT. We wanted users to be able to get the output honeycomb would building the query for them, but be able to tweak and iterate on it. Because we wanted to encourage that iteration, we realized that there would be no concrete and quantitative result we could rely on that would cleanly describe whether the feature itself was good. If users ran more queries, maybe it was good, maybe we were just consistently being not useful. Maybe fewer queries were good, but maybe they just weren't using the product or they didn't understand what was going on. So we knew we would need to capture this qualitative feedback, the yes no, I'm not sure buttons, so that we could understand from the user's perspective whether this thing that we tried to sre them was actually helpful or not. And then we could posit some higher level product goals, like product retention for new uses, to layer on top of that as a spoiler, we hit these goals. We were thrilled, but we did a lot of stumbling around in the dark along the way. And today, six months later, it's so much more common for us to meet someone playing around with llms than someone whose product has actual LLM functionality deployed in production. And we think that a lot of this is rooted in the fact that our teams have really embraced observability techniques in how we ship software, period. And those were key to building the confidence to ship this thing fast and iterate live and really just understand that we were going to have to react based on how the broader user base used the product. These were some learnings that we had fairly early on. There's a great blog post that this is excerpted from. You should check it out if you're again in the phase of building on llms but it's all about things are going to fall apart. It's not a question of how to prevent failures from happening, it's a question of can you detect it quickly enough? Because you just can't predict what a user is going to type into that freeform text box. You will ship something that breaks something else, and it's okay. And again, you can't predict. You can't rely on your test frameworks, you can't rely on your CI pipelines. So how do you react quickly enough? How do you capture the information that you need in order to come in and debug and improve going forward? So let's get a little bit, go one level deeper. How do we go forward? Well, talked a lot about capturing instrumentation, leaving this paper trail for how and why your code behaves a certain way. I think of instrumentation, frankly, like documentation and tests, they sre all ways to try to get your code to explain itself back to you. And instrumentation is like capturing debug statements and breakpoints in your production code, as much in the language of your application and the unique business logic of your product and domain as possible. In a normal software system, this can let you do things as simple as figure out quickly which individual user or account is associated with that unexpected behavior. It can let you do things as complex as deploy a few different implementations of a given np complete problem, get it behind a given feature flag, compare the results of each approach, and pick the implementation that behaves best on live data. When you have rich data that you need to tease apart all the different parameters that you're varying in your experiment, you're able to then validate your hypothesis much more quickly and flexibly along the way. And so in the LLM world, this is how we applied those principles. You want to capture as much as you can about what your users are doing in your system in a format that lets you view overarching performance, and then also debug any individual transaction. Over here on the right is actually a screenshot of a real trace that we have for how we sre building up a request to our LLM provider. This goes from user click through the dynamic prompt building to the actual LLM request response parsing, response validation and the query execution in our product. And having all of this full trace and then lots of metadata on each of those individual spans lets us ask high level questions about the end user experience. Here you can see the results of that yes, those yes no I'm not sure buttons in a way that lets us quantitatively ask questions and tricky progress, but always be able to get back to okay for this one interaction where someone said no, it didn't answer their question, what was their input? What did we try to do? How could we build up that prompt? Better to make sure that their intent gets passed to the LLM and reflected in our product as effectively as possible. Let us ask high level questions about things like trends in the latency of actual LLM request and response calls, and then let us take those metrics and group them on really fine grained characteristics of each request. And this lets us then draw conclusions about how certain parameters for a given team, for a given column data set, whatever might impact the actual LLM operation. Again, you can think of that was an e commerce site having things like shopping cart id or number of items in the cart as parameters here. But by capturing all of this related to the LLM, I am now armed to deal with whoa, something weird started happening with llms with our LLM response. What changed? Why? What's different about that one account that is having a dramatically different experience than everyone else, and then what's intended? We were also able to really closely capture and track errors, but in a flexible, not everything marked an error is necessarily an error kind of way. It's early. We don't know which errors to take seriously and which ones don't. I think a principle I go by is not every exception is exceptional. Not everything exceptional is captured as an exception. And so we wanted to capture things that were fairly open ended, that always let us correlate back to, okay, well, what was the user actually trying to do? What did they see? And we captured this all in one trace. So we had the full context for what went into a given response to a user. This blue span I've highlighted at the bottom, it's tiny text, but if you squint, you can see that this finally is our call to OpenAI. All the spans above it are work that we are doing inside the application to build the best prompt that we can. Which also means there are that many possible things that could go wrong that could result in a poor response from OpenAI or whatever llms you're using. And so as we were building this feature, and as we knew we wanted to iterate, we'd need all this context if we had any hope of figuring out why things were going to go wrong and how to iterate towards a better future. Now, a lot of these behaviors have been on the rise for a while, may already be practiced by your team. I think that's an awesome thing. As a baby software engineer, I took a lot of pride in just shipping really fast, and I wrote lots of tests along the way, of course, because I was an accepted and celebrated part of shipping good code. But in the last decade or so, we've seen a bit of a shift in the conversation. Instead of just writing lots of code being a sign of a good developer, there's phrases like service ownership, putting developers on call, testing in production. And as these phrases have entered our collective consciousness, it has shifted the domain, I think, of a developer from purely thinking about development to also thinking about production. And I'm really excited about this because a lot of these, the shift that is already kind of underway of taking what we do in its TDD world and recognizing they can apply to production as well through Ollie or observability. We're just taking these behaviors that we know as developers and applying it under a different name in development or in the test environment. We're identifying the levers that impact logical branches in the code for debug ability and reproducibility, and making sure to exercise those in a test in observability. You're instrumenting code with intention so that you can do the same in production. When you're writing a test, you're thinking about what you expect and you're asserting on what you'll actually get with observability and looking at your systems in production. You're just inspecting results after the changes have been rolled out and you're watching for deviations when you're writing tests, especially if you're practicing real TDD, I know not everyone does. You're embracing these fast fail loops, fast feedback loops. You are expecting to act on the output of these feedback loops to make your code better. And that's all observability is all about. It's shipping to production quickly through your CI CD pipeline or through feature flags, and then expecting to iterate even on code that you think is shipped. And it's exciting that these are guardrails that we've generalized for building and maintaining and supporting complex software systems that actually are pretty transferable to llms and maybe to greater effect for everything that we've talked about here, where again with the unpredictability of llms, test driven development was all about the practice of helping software engineers build the habit of checking our mental models while we wrote code. Observability is all about the practice of helping software engineers and sres or DevOps teams have a backstop to and sanity check for our mental models when we ship code and this ability to sanity check is just so necessary for llms, where our mental models are never going to be accurate enough to rely on entirely. This is a truth I couldn't help but put in here. That has always been true that software behaves in unpredictable and emergent ways, especially as you put it out there in front of users that aren't you. But it's never been more true than with llms that the most important part is seeing and tracking and leveraging about how your user SRE using it as it's running in production in order to make it better incrementally. Now, before we wrap, I want to highlight one very specific example of a concept popularized through the rise of SRE, most commonly associated with ensuring consistent performance of production systems service level objectives are slos. Given the audience and this conference, I will assume that most of you are familiar with what they are. But in the hopes that this talk is shareable with a wider audience, I'm going to do a little bit of background. Slos, I think are frankly really good for forcing product and service owners to align on a definition of what it means to provide great service to users. And it's intentionally thinking about from the client or user perspective rather than, oh, cpu or latency or things that we are used to when we think from the systems perspective. Often slos are used as a way to set a baseline and measure degradation over time of a key product workflow. You hear them associated a lot with uptime or performance or SRE metrics, and being alerted and going and acting if slos burn through an error budget. But you remember this slide when the LLM landscape is moving this quickly and best practices are still emerging, that degradation is guaranteed. You will break one thing when you think you're fixing another, and having slos over the top of your product, measuring that user experience are especially well suited to helping with this. And so what our team did after these six weeks, from like first line of code to having fully featured out the door, the team chose to uses slos to set a baseline at release and then track how their incremental work would move the needle. They expected this to go up over time because they were actively working on it, and they initially set this SLO to track the proportion of requests that complete without an error, because again, early days we weren't sure what the LLM API would accept from us and what uses would put in. And unlike most slos, which usually have to include lots of nines to be considered good, the team set their initial baseline at 75%. This is released as an experimental feature after all, and they aimed to iterate upwards. Today we're closer to 95% compliance. This little inset here on the bottom right is an example of what you can do with slos once you start measuring them, once you are able to cleanly separate out. These are requests that did not complete successfully versus the ones that did. You can go in and take all of this rich metadata you've captured along the way and find outliers and then prioritize what work has the highest impact on. Yours is having a great experience. This sort of telemetry and analysis over time. This is a seven day view. There's 30 day views. Whatever your tool will have different time windows. But being able to track this historical compliance is what allows the team to iterate fast and confidently. Remember, the core of this is that llms are unpredictable and hard to model through traditional testing approaches. And so the team here chose to measure from the outside in to start with the measurements that mattered, users being able to use the feature period and have a good experience, and then debug as necessary and improve iteratively. I'll leave you with two other stories. So you believe that it's not just us. As we were building our feature, we actually learned that two of our customers were using honeycomb for a very similar thing. Duolingo language learning app care a lot about latency. With their LLMS features being heavily mobile, they really wanted to make sure that whatever they introduced felt fast. And so they captured all this. Metadata only shown two examples, and they wanted to really closely measure what would impact the llms being slos and the overall user experience being slow. And what they found, actually, the total latency was influenced way more by the things that they controlled in that long trace, that building up that prompt and then capturing additional context. That was where the bulk of the time was being spent, not the LLM call itself. And so again, their unpredictability happened in a different way. But in using these new technologies, you won't know where the potholes will be. And they were able to be confident by capturing this rich data, by capturing telemetry from the user's perspective that, okay, this is where we need to focus to make the whole feature fast. Second story I'll have for you is intercom. Intercom is a sort of a messaging application for businesses to message with their users. And they were rapidly iterating on a few different approaches to their LLM backed chatbot, I believe. And they really wanted to keep tabs on the user experience, even though there was all this change to the plumbing using on underneath. And so they tracked tons of pieces of metadata for each user interaction. They captured what was happening in the application, they captured all these different timings, time to first token, time to first usable token, how long it took to get to the end user, how long the overall latency was, everything. Then they tracked everything that they were changing along the way version of the algorithm, which model they were using, the type of metadata they were getting back. And critically, this was traced with everything else happening inside their application. They needed the full picture of the user experience to be confident in understanding that they pull one lever over here, they see the result over here, and they recognize that using an LLM is just one piece of understanding this user experience through telemetry of your application, not something to be siloed over there with an ML team or something else. So in the end, LLMs break many of our existing tools and techniques that we use to rely on ensuring correctness and a good user experience. Observability can help. Think about the problem from the outside in. Capture all the metadata so that you have that paper trail to debug and figure out what was going on with this weird LLM box and embrace the unpredictability. Get out to production quickly, get in front of user yours and plan to iterate fast. Plan to be reactive and embrace that as a good thing instead of a stressful piece instead. Thanks for your attention so far. If you want to learn more about this, we've got a bunch of blog posts that go into much greater detail than I was able to in the time we had together. But thanks for your time. Enjoy the rest of the conference. Bye.
...

Christine Yen

CEO @ Honeycomb

Christine Yen's LinkedIn account Christine Yen's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways