Transcript
This transcript was autogenerated. To make changes, submit a PR.
Welcome to LLM 2024 organized by
conferred tour. My name is Indika Vimalasurier
and I'll walk you through about how you can leverage observability
maturity model improve end
user experience of the apps you are going to develop
using LL lens. So we will touch about how to start
which is the foundation, and then probably take it up to
around using AI to
support your operations. As you might aware,
by around 2022,
the hype started with Chatgbt. ChatgBT was
a hit, it was mainstream and it resulted
in lot of people who are not into AI start creating
generative AI apps. So now it's already has taken over
the world. The world is looking at what are the use cases
which we can use and
leverage. It's already mainstream.
There's lot of developers who are building apps
connecting llms. So there's a need.
Apps which we are going to develop has a capability of
providing full end user experience because
we all know how it can end. While generative
AI is which is opening creating
lot of new opportunities. We also want to ensure that the apps which
is being developed, deployed properly
in production environments and are being served to end users
as per the expectation and we don't want to make it
ops problem. So we want to ensure we build a solid
observability into our llms as well.
So as part of today's presentation, I'll provide you a quick intro,
what is observability? And we'll discuss about what
is observability mean for llms. So there are two kind
of observability which we can discuss,
so which we are going to discuss, so which is about a direct
observability, and second one is about indirect observability.
So I'll be focusing more on indirect observability
when discussing about the maturity model, which I'm going to
walk you through. Then I'll walk you through about some of the pillars,
give a quick intro about what is the LLM
look like, and then we'll jump into my main focus,
a maturity model for LLM. So then
we'll look at some of the implementation
guidelines, the services which we can leverage, and then of course,
like every other maturity model, this should
not be just a maturity model where people will just follow blindly,
but we want to ensure we tack into business outcomes
so we have an ability to measure the
progress. And then we'll wrap this up with some of the
best practices and some of the pitfalls I think you should avoid.
Before we start, a quick intro about myself.
My name is Indigo Emilasuri. I'm based out of Colombo.
I'm a serious reliability
engineering advocate and a practitioner as
well. I'm a solution architect with specialize in SRE observability,
AI ops and generative AI working
at Vergisa as a senior systems engineering manager,
I'm a passionate technical trainer. I have trained hundreds
of people when it comes to SRE observability
aiops and I'm an energetic technical blogger
and I'm very proud AWS community builder
under cloud operations and also a very proud ambassador
at DevOps Institute which is also known as PC
CERT because they have acquired it. So that's about me.
So I am very passionate about this topic, observability. So when
it comes to the distributed systems and then llms,
the end of the day I look at things from a customer experience
and how we can provide better customer experience to our end users
and then how we can make a better business
outcomes part of the presentation I'm
mainly focused on AWS. So I'm looking at
llms especially deployed in and
been accessed through AWS. So one
of the fantastic service AWS has offered is Amazon
Bedrock, which is a managed
service where you are able to use APIs to
access the foundational models. So it's
really fast, it's really quick, you just have to ensure that
you have the ability of connecting. So the key features
are it's giving access to the foundation models and
the use cases such as text generation, image generation and
the use cases around those. So it's also providing
this private customization with own data with the
techniques like the retrieval augmented
generation. We call it rack. And it's also providing the
ability of building agents and executed tasks using
the system, enterprise systems and other data sources.
Obviously one good thing is that there's no infrastructure, so you
don't have to worry about infrastructure. So AWS is taking
care of the infrastructure. So that, that's why we call it fully managed.
So it's a very secure and it's a, you know, it's a
go to tool if you want to develop generative AI apps.
It's already consist of, you know, some of the most
widely used foundation models provided by a 21
labs anthropic cohere, meta and stability
AI, Amazon as well. So there are a lot of models and they are
also continuously adding these models into their.
So with that, our observability
maturity model or the approach is mainly focused on application,
which has been developed using Amazon bedrock.
So moving on. I just want to give a quick idea like you know,
so when we say generate AI apps, so what is kind of the
use case? The use cases, a typical user
can kind of like enter query. So it will
come into our, the query interface like we can take it from
my API or user interface. And then the,
we will process, start processing this user query and
then we will try to connect it with the vector encoding.
So it's trying to find similar queries,
similar patterns using in our vector database.
And then we will kind of like looking at retrieving the top k
most relevant context from the vector
database and then we'll make it as input in,
combine it with input when we are providing into llab.
So why? So the key thing to notice that we generally combine
the user input as well as the retrievals
we receive from the vector database. With that we
will start inferencing with the LLM, we will send
the LLM the request input and
then we will start updating the output as well,
which we can combine with our rack integration and then finally
we can send it to after customization to end user.
So this is typically a workflow of
generative AI and this is the way we want to kind of like
enable observability. What is observability?
I'm sure most of you are aware, but just to ensure
that we are kind of in the same page, I just spend a quick
short amount of time to give my perspective
of observability. So observability is nothing but ability
to intercept or understand the system's internal
state by looking at its external output.
So what are the external outputs? We are typically looking at locks,
metrics and traces. So I like to think, you know,
observability is like, you know, looking at the big picture entire this mountain,
not only focusing on what's, you know, outside the water.
So what we are trying to look at it, trying to ask the questions like
what is happening in my system right now, how the systems are performing
and what anomalies they are in my system, what are the different
components interacting with each other, what causes
a particular issue or failure? So when it comes
to monitoring observability, obviously there are a
lot of good things when it comes to observability,
because observability is more of a proactive
approach, it's active approach instead of a passive,
and it's looking at the big picture and looking at more of a qualitative
and quantitative data. We want to make a quick
discussion and agree on something. And we want to agree when
we say observability and llms, what that means. So when
it comes to observability in llms or the
apps being developed using llms, we can divide it into
two parts. One is something we call direct LLM
observability or observability of LLM itself.
So what that means is that we in this scenario we will
start monitor, evaluate and look at the large language
model directly. So this is all about observability into
large language model. But then there are other aspects like indirect
LLM observability or observability of applications of
the systems using LLM. Here we are not looking at
the LLM directly, but we are looking at the applications
or the systems connecting utilizing LLM.
So this is just to ensure that, you know, we are able to both ways,
we are able to provide some really good benefits to the end users.
So both have its and
the techniques we will use is pretty much the similar standard way.
When it comes to observability, we will look at, you know, how we can look,
leverage the logs, the metrics, the traces and other things.
So now if we kind of quickly look at, you know, what,
when we mean direct LLM observability, what that means.
So here we will integrate observability capabilities during
the training, the deployments of LLM and while it's been
used, so it's at LLM itself.
So the main objective is we want to gain insight
into how the LLM is functioning, identify anomalies and other
issues directly related to LLM, understand the decision making
process of LLM here, how we approach this, we will activate
logging and we will look at things like the attention,
weights and other internal states of the LLMs when it's doing,
when we are doing inferences, we will implement probes or
instrumentation with the model architecture. So the observability
is being implemented at LLM level and we
will start tracking performance metrics such as latency and the
memory usage, and also things like external techniques like
attention visualization. So as I said, this is more of
LLM level. So this is about fully fledged looking
at how the LLM is performing. So when
it comes to indirect LLM observability, we are mainly looking
at the applications or the systems
which we have developed connecting with LLM. So here
we are not looking at LLM isolately,
but we are fully focused on the application side. So this is
to understand when it comes to our application, how is our application
is behaving, what observability things which we can enable
and how we can interpret the internal state.
This makes sense because just like any other application
for Genai also we want to understand how is our
application performed like because there can be any number of issues coming in.
And here again, you know, it's end of end user customer experience, it's the users
who are using our solution here what
we are looking at is again we will look at the logging,
the other inputs and outputs related to LLM.
We will looking at the monitoring metrics,
we will look at enabling anomaly detections
on some of the LLM outputs. Obviously we need
the human feedback loops as well. And then you know,
we will look at lot of metrics such as error rate, latency.
And the key objective as you would have already guessed is to understand
how is our application is behaving and how is our application
is leveraging LLM and how good kind
of output we are providing into our end users.
So when it comes to the LLM observability in
this presentation, when I say LLM observability, I am looking at indirect
LLM observability. So I am looking at coming up with
the maturity model which is catering
to applications develop using
application develops connecting to AWS bedrock because AWS is
what I am focusing on and the other aspects of AWS
is bedrock. So we are trying to see how we can integrate
observability practice into generative AI applications.
So we are looking at, you know, how we can identify
these applications internally. States also focus on some
aspects of LLM and the prompt engineering.
So we will look at the indirect oversight of LLM functionalities
and we try to make sure that the generative AI applications are
reliable and they are providing
what is it's been designed and the end users are happy with
the performance. So we want to answer this question,
why observe build for llms. So just like any other application
llms also that generative apps being developed
using llms also require observability because we need
observability, you know, when it comes to ensuring we kind
of like make sure that our generative applications
are correct, it's provide the correctness,
the accuracy and it's about the, the performance,
it's about providing great customer experience. But when it
comes to llms, llms have its own challenge. It's sometimes
it's complex. We might have to look at,
you know, what kind of anomalies, you know, or the model bias it's having
or the model drift. So when we say model
drift is the model can be working fine when
we are doing testing for considerable period of time but it
started, you know, it start failing. So this can have a adverse
impact on our end user performance.
And sometimes the models can create some biasness,
you know, which is again, you know, bad, which can create some bad customer experiences.
Then we will look at the pretty much the other standard things like debugging,
troubleshooting, how best we are using our resources
and the ethics, the data privacy, security.
So implementing kind of like looking at these things
again, observability for LLM is very important because
that is again allowing us to provide and
you know, kind of like give generate
great customer end user experiences.
So now we'll focus on trying to understand what are the pillar shaping
llms. What are the pillar shaping or
LLM observability. So one of the key things is
I'd like to split into few parts. One is that LLM
specific metrics. So one is that LLM inference
latency. Here we track the LL latency of llms,
the request, you know, which is coming to bedrock application.
We will start monitoring the latency at different stages
of the request, such as like when they coming from the
API gateways and lambda functions, LLM itself.
So where however we have defined, we'll try to look at the
potential bottlenecks and how we can improve or optimize the performance.
And then we will look at LLM inference success rate. So we will start
monitoring the success rate of, you know, the request going and coming from LLM.
And then we will start, you know, looking at what are the errors
and whether there's increase in errors, what is the reason for errors,
all the troubleshooting aspects as well. And we have
this LLM quality, output quality where, you know,
we will like trying to understand the quality
of the LLM outputs. So which is again important.
So that kind of gives us the ability to kind of like, you know,
improving those areas. And one other important thing is LLM prompt
effectiveness. So it tracks the effectiveness of the prompts
which we are kind of like sending to LLM. So this again,
you know, we will start monitoring the quality of LLM outputs
based on those prompts and based on various different kind of prompts and
how these are getting deviated and then start continuously
refining this and moving on. Some of the other things
are, you know, about LLM model drift. So we will
start monitor the distribution of, you know, LLM outputs with the application,
understand over period of time whether there's any significant
output distributions. And then we'll start tracking the performance.
And of course we will have to start looking at the cost and then when
we are integrating with llms, whether there are integration issues,
especially because, you know, we are integrating with the
AWS, the bedrock, and then we will look at
some of the ethical consideration as well. So we will start monitor llms output
with the bedrock itself for potential
ethical things, violations and other things. So we'll have to ensure
that, you know, our generative AI apps which we have developed
are 100% safe, there's no harm, illegal or discriminatory
content, and llms are, and the
generative AI apps are safe to use. So with
that, you know, we are looking, we kind of like, those are the key things,
you know, when it comes to the LLM specific metrics.
And then when it comes to the prompt engineering properties, we will look at the
temperature, we will start, you know, see how we can control randomness in
the model, because the more higher the temperature,
the diverse the outputs are. And you know,
if you can lower the temperature, the more focused the outputs are. And then
we will look at the top P sampling so that we know we can control
the output diversity. And then we will look at the top k
sampling and things like Max token
and the stop tokens, you know, which is about signals to model to step
generating text when, you know, this encountered.
We will look at the repetition penalties, present penalties, batch sizes as well.
So all of these things, you know, we can extract via logs and then send
it to cloud lots, the cloudbot. And then, you know, we can
create custom metrics and then start visualizing.
And then two other thing is we can look at the, you know, in the
inference latency, we can check whether the time taken for
model to generate output for the given inputs.
And then we look at the model accuracy and the matrix as well.
So these things, you know, probably we are using AWS X ray and
then, you know, start publishing these things into cloud work and
then we can bring and create the alarms and wrappers around
that. And few other things are other specific
things. One thing we have to look at it that, you know, when it comes
to the rag models, so what are the key things?
So when it comes to rags, you know, we again have metrics like query latency.
We want to understand the time it takes for the rack models
to process a query and generate the response. And then we will look
at the success rate, how successful are these queries and how
often it's getting failed. We will look at the resource utilization and
you know, in case if you are using caching, we look, we can look at
the cache, it's as well. And when it comes to logs,
we look at the query logs, error logs and the audit logs, you know,
which will probably, probably give us a comprehensive way of, you know,
auditing, troubleshooting. And then we'll try to enable traces,
x ray, you know, which will provide us the end to end tracing so that
way that we can have a complete
observability into the data store or data retriever
and other pillars are the tracing. So we have, we will use x ray,
you know, so that will enable us to get integrate the traces and
we will look at, you know, other integration AWS services as well.
And then we will use Cloudwatch as a visualization tool. We can
also use the Grafana, the AWS managed Grafana
or any other things as well.
So one other key thing is be mindful about alerting and incident
management. So we can use the cloud virtual arms and we
can leverage AWS system manager as well.
So one important thing is the security. So we will leverage AWS cloud
trail to audit and monitor the API calls and
we'll ensure that the compliance with security and regulatory requirements
are being tracked. I know we can integrate crowd logs with cloud
work logs for centralization and then we will use
AWS config so that we can continuously monitor and
assess the configuration of our systems, AWS resources and
we can ensure that we have compliance and best practices with
the compliance standard with that.
One key aspect is cost as well. So the more we are using our
llms, you know, the more the cost factor comes in. So we
can leverage AWs cost explorer and AWS budgets.
And finally, one other important thing is that, you know, AI upscale building.
So we will have to ensure that all the metrics, you know, whether it's
the LLM specific, application specific or the RaG is specific,
we will kind of like enable anomaly detection. And then for
all the logs which we are putting into cloud work, we are
able to enable the log anomaly detection as well. So we can also use
Aws, the DevOps guru. So it's a machine learning
service provided by AWS. So it,
the DevOps guru will help us to detect and resolve issues
in our system, especially identifying anomalies and other
issues which probably we might not be able to uncover manually.
And then we will look at leveraging AWS code guru as
well because this allow us to integrate with the application so that
we can do profiling and we can do the understand the
resource utilizations usage based on our applications.
Another very important thing is use AWS forecasting.
So all the metrics and other things which we are bringing into the
table, we can use the forecasting that will able to
understand things in advance so that we can make better decisions
and we can plan things ahead with that.
Probably you can ask the question why we need a maturity model. So I
am a big fan of maturity model because I think maturity models act as
a north star. So we all want to start someplace and then take
our systems into observability journey. So if you do
that without kind of a maturity model or framework, then it's are,
you know, you, you may ended up with any place, but by
using a maturity model you can guarantee that, you know, you start with the basic
steps and then you can finish with it some of the advanced things
and you have better control of how you go there.
So the LLM, the indirect
observability maturity model, I have three pillars. One is,
I call it level one, which is about foundational observability.
And level two is the proactive observability. At level three
we are looking at advanced LLM observability with AI Ops.
So in the level one we will start, you know, capturing some of the basic
LLM metrics. We will start getting the logs and start
monitor the basic from properties, and we will implement basic
logging and other distributed tracing. And then we will put up the visualization
and other basic alerts as well. So this
will kind of give you a foundational observability into your generative AI
application. The next step is, you know, taking system more
proactive, like be proactive. So here we will start,
you know, capture and analyze the advanced LLM metrics and you know,
start, you know, start leveraging the logs, then the
other advanced prompt properties. And then we
will enhance alerts and other incident management workflow so that
we can identify things much faster and you know,
resolve things much faster as well. So we will bring in the security
aspect, the security compliance. We will start generating,
leveraging, AWS forecasting so that we can start focusing
about some of the LlMe specific matrix, matrix and the
prompt properties as well. And for the logs
we can also set up log anomaly detection.
And when it comes to level three, which is kind of like the advanced level,
which is the kind of place where you all want to be in, but you
have to be mindful that it's a journey. Like you will have to start with
level one, go to level two, and then we can be into level three.
So at level three we start with integrating with DevOps guru
and the code guru, so that with DevOps Guru will provide the AI
and ML capabilities code guru will provide our quality
of the code and then we will start implementing AIOps
capabilities like other things like the noise reduction,
smart intelligent root causes and then kind
of like business impact assessments. So the forecasting feature
will kind of like allow us to understand, if at all, if the
models can drift, when that can happen, if at all,
the models can start having a bias, when that can happen,
the response time predictions and all those things. So the
AI kind of thing is, can give you a full control of, you know,
predictability of your generative AI application.
So now I am kind of like look at more focus on
implementation angle. So in the foundation model, like, you know, we can
use cloud work metrics, like so that we can capture the
basic LLM metrics, like, you know, the inference time, model size,
prompt length and those things, the prompt properties. Again, we can,
you know, leverage the sender, those logs into cloud work logs
so that, you know, we can start monitoring basic properties like from
content prompt sources and those things, any other logs,
you know, we will be shipping into cloud work so that we can start,
you know, getting the basic, the detail.
And then we will integrate AWS x ray based on the technology
we are using to develop our LLM to generate
a app so that we can have ability to start looking
at the traces and then visualization the dashboards.
We can use AWS Cloudword dashboards and if required, you know,
we can go into AWS managed Grafana dashboards
as well. Alert and incident management. We are leveraging Cloudwatch
and that will help us to understand some of the
more the basic to a medium complex
some of these monitors so that we can have a good control of
our, how the llms are behaving and like how,
how is our, the prompt is successful
and overall how is our generative application is behaving and probably
not probably, but how our end users are feeling about it.
And then we will wrap this up with the cost like we using AWS Explorer.
Because llms are sometimes costly, we'll have to ensure the usage and
we start monitoring that as well. So level
two, like, you know, we will go a little bit advanced for the metrics.
We will start, you know, looking at advanced metrics like model performance and
output quality. And again, prompt the properties.
We will look at the advanced properties like the prompt performance,
prompt versioning. And then, you know, we will start advancing,
improving the incident workflows. We will look at the
security compliance, we will look at more into the uplifting
and like improving the cost factor as well. And one
of the key thing, you know, here we will bring in is AWS forecasting.
So using forecasting we want to ensure that we have the ability of
forecasting all the key, the complex or even every
metric related to LLM performance, LLM the
inference, the accuracy and the prompt properties related
things as well. So and then we will also look at
enabling metric anomalies, log anomalies so that, you know, we start
using some of the capabilities of anomalies this and
finally we will bringing in AWS DevOps guru,
so and the code guru and that will allow us to bringing in
the AI capabilities into here the AI ML capability so
that we can look at things from holistic ways. DevOps guru is a
perfect tool. And then we will, you know, bringing in AI of practices
and then kind of like, you know, bring ensuring that our incident workflow
are more into self healing and there are a
lot of other improvements and AI kind of
things which we can bring. So what
are the other things like, you know, so we will look at bringing in while
we do this, we want to ensure that we measure the progress.
So once we enable observability. So we want to ensure
that we look at LLM. So, so how the LLM output quality
is getting improved, how we are improving, optimizing our
LLM prompt engineering area and then like
ensuring that, you know, we can able to detect our model drifts in advance
and then we can take necessary actions. We look at, you know, what are the
ethical things, you know, our models based
on that, how our models are behaving and then, you know, we look at
the interpredictability, extendability and start
keep a close eye on those things and generally like,
you know, we will start kind of looking at end user experience
as well. We will clearly define some end user specific
service level objectives. We will start, you know, track the
metrics and the improvements and we will start looking at
the customer experience, ensure that, you know, whatever we
do is align and correlate with customer
experience. We see increasing customer experience as well.
So and like overall that,
you know, we develop and provide a better world class services into for our
end users. And then some of the best practices is
like, you know, so we will have to use a structured log and you know,
if case you are heavily using lambda, probably go to power tools,
you'll have to instrument the code, ensure that, you know, you get all the,
the critical, the LLM specific metrics.
Then you obviously use x ray to enable the traces as well.
So the metrics which we are extracting, it has to be meaningful
it has to add value. So it should be aligned with our business objectives
as well. And to wrap up like, you know, some of the
pitfall is that, you know, ensure that, you know, we kind of have a security
we plan in advance and the compliance as well
because that's again a key thing, you know, modern day when we are
using generative applications and clearly define
the roles like whatever objectives, you know, we are going
to achieve with this. And probably you can have some numbers,
you can have some measurable things so that you can start, you know,
performing and kind of like try to get the benefit.
So with this like, you know I'm, we are at the close so
thank you very much. So here I have taken AWS as
example, especially AWS bedrock. From here we have look
at what is a general architecture of workflow of
generator application and what are the key pillars of
the observability LLM related observability pillars
which we have to enable and then we will look at the three,
the levels, the foundation, the proactive observability
and advanced observability is aiops. And then we
have look at some of the best practices and the pitfalls and
more importantly how we can look at this from ROI perspective.
So with this, thank you very much for taking time. I hope, you know,
you kind of like enjoy this and then you have understood
or you have taken few things which you can take into your
generative application and make it observable and leverage into to
deliver great customer experiences.
So with this, thank you very much.