Transcript
This transcript was autogenerated. To make changes, submit a PR.
Thanks for joining me today. I'm Christine
and I'm going to start with a disclaimer. Honeycomb is
an observability tool, but the techniques that I'll describe today should
be transferable to your tool of about this
will draw from our experience of building on llms,
but should apply to whatever LLM and observability stack
you're using today. All right,
it software in 2023 feels more like
magic than it ever has before. There are llms everywhere
with a cheap API call to your provider of choice.
It feels like every CEO or even CEO
is now turning to their teams and asking how llms can be incorporated into
their core product. Many are reaching to define an AI strategy,
and there's lots to be excited about here. It's cool to be squarely in
the middle of a phase change in progress where everything is new to
everyone altogether. But there's also a reality check to trying
to suddenly incorporate all this new technology into our products.
There's suddenly a lot more demand for AI functionality than
there are people who carry expertise for it and software engineering teams
everywhere sre often just diving in to figure it out
because we're the only ones left. Which to be clear,
is just fine by me. As someone who used to identify as a generalist software
engineer, the fewer silos we can build in this industry,
the better. Because on one know,
using a large language model through an API is like any other
black box you interact with via API. Lots of
consistent expectations we can set about how we make sense of these
APIs, how we send parameters
to the API,
what types and scopes those inputs will
be, what we'll get back from those APIs,
and it's usually done over a standard protocol. And so all of these
properties make working with APIs,
these black boxes of logic,
into something that is testable and mockable,
a pretty reliable component in our system. But there's one key difference
between having your application behavior relied on an LLM versus,
say, a payments provider. That difference is how predictable
the behavior of that black box is, which then in turn influences
how testable or how mockable it is. And that
difference ends up breaking apart all the different techniques
that we've built up along the years for making sense of these complex systems.
With normal APIs, you can write unit best.
With an API you can conceivably scope or
predict the full range of inputs in a useful way. On the LLMs
side, you're not working
with just the full range of negative and positive numbers.
You've got a long tail of literally what we're soliciting
is free form natural language input from users.
We're not going to be able to have a reasonable test suite that we
can run reliably for reproducibility
APIs. Again, especially if it's
a software, as a service, it's something very consistent. You have
a payments service, typically when you say debit $5
from my bank account, the balance goes down by $5. It's predictable.
Ideally it's item potent, where if you're doing the same
transaction, bank account aside,
there's no additional strange side effects on
the LLM side. The way that many of these public
APIs are set up, usage by the public is
teaching the model itself additional behavior. And so
you have these API level regressions that
are happening, you can't control. And as
software engineers using that LLM, you need to adapt your prompts.
So again, not mockable, not reproducible.
And again, with a normal API, you can kind of reason
what it's supposed to be doing and whether the problem is on the API's
side or your application logic side, because there's
a spec, because it's explainable and you're
able to fit it in your head. On the LLM side,
it's really hard to make sense of some of these changes programmatically, because llms
are meant to almost simulate human behaviors.
It's kind of the point. And so a thing that we can see is
that very small changes to the prompt can yield very dramatic
changes to the results in ways that, again, make it hard
for humans to explain and debug,
and sort of build a mental model of how it's supposed to behave.
Now, these three techniques
on the left are ways that we have traditionally
tried to ensure correctness of our software.
And if you ask an ML team, the right
way to ensure correctness of something like
an LLM feature is to build an evaluation system to
evaluate the effectiveness of the model or the prompt.
But most of us trying to make sense of llms aren't ML engineers.
And the promise of llms exposed via APIs is that we shouldn't
have to be to fold these new capabilities into our software.
There's even one more layer of unpredictability
that llms introduce. There's a concept
of, I don't know how familiar everyone is with this piece,
but there's an acronym that is used in this world,
rag rags or retrieval augmented generation.
Effectively, it's a practice of pulling in
additional context within your domain to
help your llms return better results.
If you think about using Chat GPT prompt, it's where you
say, oh, do this but in this style, or do this
but in
a certain voice. All that extra context
helps make sure the LLM returns the result that you're looking for.
But it is because of the way that these
RaE Rag pipelines end up being built.
Really, it means that your app is pulling in even more dynamic
content and context that can
again create and result in big changes in how the LLM
is built, how the LLM is responding, and so
you have even more unpredictability in trying
to figure out why is my user
not having the experience that I want them to have?
So this turning upside down of our worldview is happening
on a literal software engineering and systems engineering level.
We know these black boxes aren't testable or debuggable in a traditional sense,
so there's no solid sense of correct behavior
that we can fall back to. It's also true from
a meta level where there's no environment within which we can conduct our
tests and feel confident in the results.
There's no creating a staging environment where we can be sure that the LLMs experience
or feature that we're building behaves correctly or
does what the user wants. Going even
one step further, even product development
or release practices are turned a little bit inside out.
Instead of being able to start with early access and then
putting your product through its paces and then feeling confident in a later or broader
release, early access programs are inherently going to fail
to capture that full range of user behavior and edge cases.
All these programs do is delay the inevitable failures that you'll encounter
when you have an uncontrolled and unprompted group of group of users doing
things that you never expected them to do.
So at this point,
do we just give up on everything we've learned about building and operating software systems
and embrace the rise of prompt engineer as an entirely
separate skill set. Well, if you've been paying attention to the title of this talk,
the answer is obviously not,
because we already have a model for how to measure and debug and move the
needle on an unpredictable qualitative experience.
Observability. And I'll say this term
has become so commonplace today, it's fallen out of fashion to define
it. But as someone who's been talking about all of this since before it was
cool, humor me. I think it'll help some pieces click into place.
This here is the formal Wikipedia definition
of observability. It comes from control theory. It's about
looking at a system based on the inputs and outputs and using
that to model what this system is doing. Black box
and it feels a little overly formal when talking about production systems.
Software systems still applies to,
but it feels like really applicable to a system like
an LLM, like this thing that's changing over time because
it can't be monitored or simulated with traditional techniques.
Another way I like to think about this is that less formally,
observability is a way of comparing what you expect in
your head versus the actual behavior,
but in live systems. And so let's
take a look at what this means for a standard web app.
Well, you're looking at this box has your application.
Because it's our application, we actually get to instrument it and we can capture what
arguments were sent to it. On any given HTTP request, we can
capture some metadata about how the app was running and we can
capture data about what was returned. This lets us reason about the behavior
we can expect for a given user and endpoint
and set of parameters. And it lets us debug and reproduce the issue
if the actual behavior we see deviates from that expectation.
Again, lots of parallels to best, but on live
data. What about this payment service over here on the right?
It's that black box that the app depends on. It's out of my control.
Might be another company entirely. And even
if I wanted to, because of that, I couldn't go and shove instrumentation inside
of it. You can think of this like a database too, right? You're not going
to go and fork Mysql and shove your own instrumentation in there.
But I know what requests my app has sent to it.
I know where those requests are coming from in the code
and on behalf of which user. And then I know how long it took
to respond from the app's perspective, whether it was successful,
and probably some other metadata. By capturing all of
that I can again start to reason, or at least have a paper trail,
to understand how these inputs impact the outputs
of my black box and then how the choices my application makes and the
inputs into that application impacts all of that.
And that approach becomes the same for llms, as unpredictable and
nondeterministic as they are. We know how a user interacts
with the app, we know how the app turns that into
parameters for the black box, and we know
how they respond. It's a blanket statement that in complex systems,
software usage patterns will become unpredictable and change over time.
With llms, that assertion becomes a guarantee.
If you use llms, as many of us are,
your data set is going to be unpredictable and will absolutely
change over time. So the key to operating sanely
on top of that magical foundation is having a way of
gathering, aggregating and exploring that data in a way
that captures what the user experienced as expressively
as possible. That's what lets you build and reason
and ensure a quality experience on top of llms, the ability to
understand from the outside why your user got a certain
response from your llmbacked application.
Observability creates these feedback loops to let
you learn from what's really happening with your code,
the same way we've all learned how to work iteratively with tests.
Observability enables us all to ship sooner,
observe those results in the wild, and then wrap those observations back
into the development process. With llms rapidly
becoming some piece of every software
system, we all get to learn some new skills.
SRes who are used to thinking of APIs as black boxes that can be modeled
and asserted on, now have to get used to drift and peeling
back a layer to examine that emergent behavior.
Software engineers who are used to boolean logic and discrete math
and correctness and test driven development now
have to think about data quality, probabilistic systems and
representivity, or how well your model test
environment or your staging environment,
or your mental code represents the production system.
And everyone in engineering needs to reorient themselves around
what this LLM thing is
trying to achieve, what the business goals are, what the product use cases are,
what the ideal user experience is, instead of
sterile concepts like correctness, reliability or availability.
Those last three are still important. But ultimately,
when you bring in this thing that is so free form that
the human on the other end is going to have their own opinion of whether
your LLM feature was useful or not, we all need
to think expand our mental models of what it means to provide a great
service to include that definition as well.
So, okay, why am I up here talking about this and why should
you believe me? I'm going to tell you a little bit about a
feature that we released and our experience building it,
trying to ensure that it would be a great experience, and maintaining it
going forward. So earlier this year we
released our query assistant in May 2023.
Took about six weeks of development super fast,
and we spent another eight weeks iterating on it.
And to give you a little bit of an overview of what it was trying
to do, Honeycomb as an observability tool
lets our users work with a lot of data. Our product has a visual query
interface. We believe that point and click is always going to be easier for someone
to learn and play around with than an open text box. But even
so, there's a learning curve to the user's interface and we were really
excited about being able to use llms as a translation layer from
what the human is trying to do over here on the right of this slide
into the UI. And so we added this little experimental
piece to the query. Building collapsed most of the time, but people could expand
it and we let people type in in English
what they were hoping to SRE. And we also another thing that was
important to us is that we preserve the editability and explorability that's sort of
inherent in our product. The same way that we
all as consumers have gotten used to being able to edit or iterate on
our response with Chat GPT. We wanted users to be able
to get the output honeycomb would
building the query for them, but be able to tweak
and iterate on it. Because we wanted to encourage that iteration,
we realized that there would be no concrete and
quantitative result we could rely on that would cleanly
describe whether the feature itself was good. If users ran
more queries, maybe it was good, maybe we were just consistently
being not useful. Maybe fewer queries were good,
but maybe they just weren't using the product or they didn't understand what was going
on. So we knew we would need to capture this qualitative feedback,
the yes no, I'm not sure buttons, so that we
could understand from the user's perspective whether
this thing that we tried to sre them was actually helpful or not.
And then we could posit some higher level product goals, like product retention for
new uses, to layer on top of that as
a spoiler, we hit these goals. We were thrilled, but we did a lot of
stumbling around in the dark along the way.
And today, six months later, it's so much more common for
us to meet someone playing around with llms than someone
whose product has actual LLM functionality deployed
in production. And we think that a lot of this is rooted
in the fact that our teams have really embraced observability techniques
in how we ship software, period. And those were key to
building the confidence to ship this thing fast
and iterate live and really just understand that
we were going to have to react
based on how the broader user base
used the product.
These were some learnings that we had fairly early on.
There's a great blog post that this is excerpted from. You should
check it out if you're again in the phase of building on llms
but it's all about things are going to fall apart.
It's not a question of how to prevent failures from happening,
it's a question of can you detect it quickly
enough? Because you just can't predict what a user is
going to type into that freeform text box. You will ship
something that breaks something else, and it's okay.
And again, you can't predict. You can't rely
on your test frameworks, you can't rely on your CI pipelines.
So how do you react quickly enough? How do you capture the information
that you need in order to come in and debug and improve
going forward? So let's get a little bit,
go one level deeper. How do we go forward? Well, talked a lot about capturing
instrumentation, leaving this paper trail for how and why your
code behaves a certain way. I think of instrumentation,
frankly, like documentation and tests,
they sre all ways to try to get your code
to explain itself back to you. And instrumentation is
like capturing debug statements and breakpoints in your production code,
as much in the language of your application and
the unique business logic of your product and domain
as possible. In a normal software system, this can let you
do things as simple as figure out quickly which
individual user or account is associated with that unexpected behavior.
It can let you do things as complex as deploy a
few different implementations of a given np complete problem,
get it behind a given feature flag, compare the results of each approach,
and pick the implementation that behaves best on live data.
When you have rich data that you need to
tease apart all the different parameters that you're varying in your experiment,
you're able to then validate your hypothesis much more quickly and flexibly
along the way. And so in the LLM world, this is how
we applied those principles. You want to capture as much as you can about
what your users are doing in your system in a format that lets you view
overarching performance, and then also debug any
individual transaction. Over here on the right is actually a
screenshot of a real trace that we have for how we sre
building up a request to our
LLM provider. This goes from user click through
the dynamic prompt building to the actual LLM request
response parsing, response validation and the query execution in
our product. And having all of this
full trace and then lots of metadata on each of those individual spans
lets us ask high level questions about the end user
experience. Here you can see the results
of that yes, those yes no I'm not sure buttons in a way that
lets us quantitatively ask questions and tricky progress,
but always be able to get back to okay for
this one interaction where someone said no,
it didn't answer their question, what was their input?
What did we try to do? How could we build up that prompt? Better to
make sure that their intent gets passed to the LLM and reflected
in our product as effectively as possible.
Let us ask high level questions about things like
trends in the latency of actual LLM request and
response calls, and then let
us take those metrics and group them on really fine
grained characteristics of each request. And this lets us then
draw conclusions about how certain parameters for
a given team, for a given column data set,
whatever might impact the actual LLM operation.
Again, you can think of that was an e commerce site having
things like shopping cart id or number of items in the cart as
parameters here. But by capturing all of this related
to the LLM, I am now armed to deal with whoa,
something weird started happening with llms with our LLM response.
What changed? Why? What's different
about that one account that is having a dramatically different experience
than everyone else, and then what's intended?
We were also able to really closely capture and track errors,
but in a flexible, not everything marked an error is
necessarily an error kind of way. It's early. We don't know
which errors to take seriously and which ones don't. I think a
principle I go by is not every exception is exceptional.
Not everything exceptional is captured as an exception.
And so we wanted to capture things that were fairly open ended, that always let
us correlate back to, okay, well, what was the user actually trying to do?
What did they see? And we captured this all in
one trace. So we had the full context for what
went into a given response to a user. This blue
span I've highlighted at the bottom, it's tiny text,
but if you squint, you can see that this finally is
our call to OpenAI. All the spans above it
are work that we are doing inside the application to build the best prompt that
we can. Which also means there are that many
possible things that could go wrong that could result in a poor response
from OpenAI or whatever llms you're using.
And so as we were building this feature, and as we
knew we wanted to iterate, we'd need all this context if we had any
hope of figuring out why things were going to go wrong and
how to iterate towards a better future. Now,
a lot of these behaviors have been on the rise for a while,
may already be practiced by your team.
I think that's an awesome thing. As a baby software engineer,
I took a lot of pride in just shipping really fast, and I wrote
lots of tests along the way, of course, because I was an accepted and celebrated
part of shipping good code. But in the last decade or
so, we've seen a bit of a shift in the conversation.
Instead of just writing lots of code being a sign of a good developer,
there's phrases like service ownership, putting developers on call,
testing in production. And as these phrases have entered our
collective consciousness, it has shifted
the domain, I think, of a developer from
purely thinking about development to also thinking about production.
And I'm really excited about this because a lot of these, the shift that
is already kind of underway of taking what
we do in its TDD world and
recognizing they can apply to production as well through Ollie or observability.
We're just taking these behaviors that we know as developers
and applying it under a different name in development
or in the test environment. We're identifying the levers that impact
logical branches in the code for debug ability and reproducibility, and making
sure to exercise those in a test in observability.
You're instrumenting code with intention so that you can do the same in production.
When you're writing a test, you're thinking about what you
expect and you're asserting on what
you'll actually get with observability and looking
at your systems in production. You're just inspecting results after
the changes have been rolled out and you're watching for deviations when
you're writing tests, especially if you're practicing real TDD,
I know not everyone does. You're embracing these fast fail loops,
fast feedback loops. You are expecting
to act on the output of these feedback loops to make your code better.
And that's all observability is all about.
It's shipping to production quickly through your
CI CD pipeline or through feature flags, and then expecting
to iterate even on code that you think is shipped. And it's
exciting that these are guardrails that we've generalized for
building and maintaining and supporting complex software systems that
actually are pretty transferable to llms and maybe to
greater effect for everything that we've talked about here, where again with
the unpredictability of llms, test driven development was all about
the practice of helping software engineers build the habit of
checking our mental models while we wrote code. Observability is
all about the practice of helping software engineers and sres or
DevOps teams have a backstop to and sanity check for our mental
models when we ship code and this ability to
sanity check is just so necessary for llms,
where our mental models are never going to be accurate enough to rely on entirely.
This is a truth I couldn't help but put in here.
That has always been true that software
behaves in unpredictable and emergent ways, especially as you put it out
there in front of users that aren't you. But it's never
been more true than with llms that the most important part
is seeing and tracking and leveraging about how your user
SRE using it as it's running in production in order
to make it better incrementally.
Now, before we wrap, I want to highlight one very specific example
of a concept popularized through the rise of SRE,
most commonly associated with ensuring consistent performance
of production systems service level objectives are slos.
Given the audience and this conference, I will assume that most of you are familiar
with what they are. But in the hopes that this talk is shareable with a
wider audience, I'm going to do a little bit of background.
Slos, I think are frankly really good for
forcing product and service owners to align on a definition of what it means
to provide great service to users.
And it's intentionally thinking about from
the client or user perspective rather than, oh,
cpu or latency or things that we are used to when we think from the
systems perspective. Often slos are used as a way to set a baseline
and measure degradation over time of a key product workflow. You hear
them associated a lot with uptime or performance or SRE metrics,
and being alerted and going and acting
if slos burn through
an error budget. But you remember this slide when
the LLM landscape is moving this quickly and best practices
are still emerging, that degradation is guaranteed.
You will break one thing when you think you're fixing another,
and having slos over the top of your product,
measuring that user experience are especially well
suited to helping with this. And so what our team did
after these six weeks, from like first line of code to having fully
featured out the door, the team chose to uses slos
to set a baseline at release and then track how their
incremental work would move the needle. They expected this to go up over time because
they were actively working on it, and they initially set this SLO
to track the proportion of requests that complete without an error,
because again, early days we weren't sure what the
LLM API would accept from us and what uses would put in.
And unlike most slos,
which usually have to include lots of nines to be considered good,
the team set their initial baseline at 75%.
This is released as an experimental feature after all,
and they aimed to iterate upwards. Today we're closer
to 95% compliance.
This little inset here on the bottom right is
an example of what you can do with slos once
you start measuring them, once you are able to cleanly separate out.
These are requests that did not complete successfully versus the ones that did.
You can go in and take all of this rich metadata
you've captured along the way and find outliers and then prioritize
what work has the highest impact on. Yours is having a great experience.
This sort of telemetry and analysis over time.
This is a seven day view. There's 30 day views. Whatever your tool
will have different time windows. But being able to
track this historical compliance is what allows the team to iterate
fast and confidently. Remember, the core
of this is that llms are unpredictable and hard to model through traditional testing approaches.
And so the team here chose to measure from the outside in
to start with the measurements that mattered, users being
able to use the feature period and have a good experience,
and then debug as necessary and
improve iteratively. I'll leave you with two other stories.
So you believe that it's not just us. As we were building our feature,
we actually learned that two of our customers were using honeycomb
for a very similar thing.
Duolingo language learning app care
a lot about latency. With their LLMS features being
heavily mobile, they really wanted to make sure that whatever they
introduced felt fast. And so
they captured all this. Metadata only shown
two examples, and they wanted
to really closely measure what would
impact the llms being slos and the overall user experience
being slow. And what they found, actually,
the total latency was influenced way more by the things that they controlled
in that long trace, that building up that prompt and then capturing additional
context. That was where the bulk of the time was being spent,
not the LLM call itself. And so again,
their unpredictability happened in a different way. But in using
these new technologies, you won't know where the potholes will
be. And they were able to be confident
by capturing this rich data, by capturing telemetry
from the user's perspective that, okay, this is where we need to focus to
make the whole feature fast.
Second story I'll have for you is intercom.
Intercom is a sort of a messaging application for
businesses to message with their users.
And they were rapidly iterating on
a few different approaches to their LLM backed chatbot,
I believe. And they really wanted to keep tabs on the user experience,
even though there was all this change to the plumbing using on underneath.
And so they tracked tons of
pieces of metadata for each user interaction.
They captured what was happening in the application, they captured all these different
timings, time to first token, time to first usable token, how long it took
to get to the end user, how long the overall latency was, everything.
Then they tracked everything that they were changing along the way
version of the algorithm, which model they were using, the type
of metadata they were getting back. And critically, this was traced
with everything else happening inside their application. They needed the
full picture of the user experience to be confident in
understanding that they pull one lever over here,
they see the result over here, and they recognize that
using an LLM is just one piece of understanding this user experience
through telemetry of your application, not something to be siloed
over there with an ML team or something else.
So in the end, LLMs break many of
our existing tools and techniques that we use to rely on
ensuring correctness and a good user experience.
Observability can help. Think about the problem from the outside in.
Capture all the metadata so that you have that paper trail to debug and figure
out what was going on with this weird LLM box
and embrace the unpredictability.
Get out to production quickly, get in front of user yours and plan
to iterate fast. Plan to be reactive and embrace
that as a good thing instead of a stressful piece instead.
Thanks for your attention so far. If you want to learn
more about this, we've got a bunch of blog posts that go into much greater
detail than I was able to in the time we had together.
But thanks for your time. Enjoy the rest of the conference.
Bye.