Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, welcome to the session. Large language
models have captured the imagination for different software developers and customers
who are interested in now integrating those models into their day to day workflows.
Today I'll be talking about Amazon Bedrock, which is a managed service
via which you can have access to different foundation models
using a single API. We talk about the operational
excellence best practices that you would like to consider when
using Amazon bedrock. Customers are often
looking for turnkey solution which can help integrate these llms
into their existing applications. As part of
the session, we will talk about an introduction to the operational
excellence term from a well architected review perspective. We will
talk about the llms and then we will go in depth in the bedrock.
So let's start with operational excellence. If you look at the
well architected review that AWS recommends,
we have a pillar in there which says optional excellence.
Operational excellence is basically an ability to support the development
of your workloads, how to run your workloads, gain insight
into your workloads, and essentially improve your process and procedures
to deliver business value. Operational excellence is
a practice which you develop over the course of time.
It's not something which you will be able to get
done overnight or just by adopting a particular solution.
This is how your team is structured, this is how your people process and
the technologies working together. Now,
within operational excellence, there are different design principles
that you should be considering. One of the key
principles is performing operations as a code. We are all aware of infrastructure as
code and we are aware of different tools and technologies
which are there in the market. Try to adopt as much as possible
from an operations perspective so that you can start executing these as code
snippets, making frequent and small reversible changes.
That's another key aspect of the design principle for operational excellence,
refining your operational procedures, obviously anticipating
failures and learning from your operational failures,
and finally observability, which can help you get actionable insights.
With respect to llms. Let's say you're using Amazon Bedrock,
which is an API, access into different foundation models.
You still have to follow all of these design principles in
order to how to deploy the API. How do you start versioning
the API? How do you have the different operational procedures
working together? What kind of observability can be put in?
So these are some of the factors that we will be talking about as we
go more into the session. Now let's talk about some of
the key terms that we keep hearing day in, day out DevOps,
mlops and llmops.
DevOps as a term has been around for a pretty long time.
It's basically encouraging you to break down the silos,
start having the organizational and the functional separation removed
from the different teams, and have an ownership end to end. As to whatever
you are building, you are also running and supporting that code.
Mlogs is basically using the same set of processes and
people and technology best practices within the scheme
of machine learning solutions. So you consider DevOps
is something which you would often use for microservices written in Java,
Python or Golang. When you are trying to
use the same set of technology stack, but now trying
to solve a machine learning problem where you suddenly have a model,
you need to train the model, you need to have the inference of the model,
you need to have multi model endpoints.
You want to incorporate these practices into how this model is getting trained,
how it is getting deployed, how the approval process is going to be there,
and ultimately how the inference is going to be there, be it a real time
inference or a batch inference. So that's the mlops part in there.
What about llmops? So far, mlops is
mostly used for specific machine learning models which you have
created to solve a single task. With large language
models, you have a capability of using a single
foundation model to solve different types of tasks.
For example, a foundation model like something
which you would be having access via
the bedrock, you can use it for text summarization,
you can use it as a chatbot, you can use it for
question and answers. There are different business scenarios where
you can use these models. Hence the llmops as a term
is using that single model. That is
the foundation model for different aspects of your business.
So in all your best practices on the operational excellence which we spoke about
in the previous slide, they remain quite consistent.
But just the nature of these specific problems which you
are solving would be differing depending on the machine learning solutions
or the LLM solutions which you have. At its core,
every solution that you are creating is going to be talking about
people, process and technology.
Now we say MLops is basically productionization
of your ML solutions effectively. So let's say I deploy
a solution into production which has the model go through its own
set of training. Someone has provided an approval. Now it
is an inference, be it a batch inference or a real time inference.
There is a lot of overlap that happens with foundation model operations
such as generative AI solutions using text, image,
video and audio. And finally, when you talk about
llms, these are basically large language models which are using,
again for productionization. There are some of the attributes which would
change in terms of the metrics that
you're looking at, but the process more or less remains the same.
And then there are some more additional customizations that you would incorporate
into llms with say, rag or fine
tuning, etcetera. We'll talk about it in a, in a few slides.
At its core, it's still going to be people, process and technology,
and that overlap is going to be consistent,
irrespective of what kind of operational excellence you're going for,
be it you using the best technology that is available in the market.
You still need to train your people who can effectively use that technology
to derive the business value. You need to have
a close correlation between the team which is training your model,
or maybe fine tuning that model, and ultimately the consumers who
are going to be using that model. That aspect of people,
process and technology. And obviously, Conway's law doesn't change much
when it comes to deploying a software, whether you're deploying it using an
LLM or you're deploying it in the traditional sense
with microservices or even a monolith application.
Now, let's talk about foundation models. And the first thing that we want
to talk about is the model lifecycle. So in a
typical machine learning use case, you will have a model
lifecycle where you have a lot of data, and using that data now
you're going to go into a processing stage where you're processing
all of your information. That data has been labeled, maybe a supervised
learning or unsupervised learning, whatever is your choice algorithm that you're
doing. And then once the training has been done,
now you have a hyper parameter tuning that you're doing to ultimately
create a model. That model is going to go through the model validation,
testing, and once that model is ready,
you are going to be using for that specific task. That's the
important part here. You're going to be using for this specific task because
the model has been tuned and trained for that particular task.
Tomorrow you have new set of data, you're going to do an iteration,
and then you're going to do the training of the model and ultimately
redeploy the model once the requisite approvals are available.
So this is just one project. When it comes to
foundation models or large language
models, your data set is no longer just one data
set. You're training that model using every possible
data set. For example, the model which you
have from meta, the Lama models you are,
they have been trained, pre trained with large data sets,
70 billion parameters, etcetera. Once that
model has been made available. Now, from that foundation model,
you can either do a fine tuning if you're interested in doing that.
So that's the project b that you're looking at. But then from the
same foundation model, you can directly use it for some task specific
deployments. And then once you do a fine tuning or rag
or something else, you can use that same model for a different
use case. So that's the key difference here. You're using a single model
for different projects and different scenarios with some
alterations. And in the previous case you are having one model
for each of it. Now, with Llmops,
there can be different types of users that you're interacting with.
And basically I want to talk about the generative AI user
types and then the skills which are needed. Let's talk first about
the providers. You have a provider, let's say someone is building an
NLM from scratch. In this case we take a Lama model
which has been built from scratch, and that model can
be used for different use cases,
NLP, data science, model deployment, inference,
etcetera. So that's a provider. You have got a model from there internally,
your team can decide to have a fine tuning on those
models. So those are the people who are doing a fine tuning on the model
to fit custom requirements. Maybe you have a business specific data
which you want the model to be a little bit more aware of.
So you're going to be training that model, fine tuning that model,
using that business particular data, domain specific knowledge that you're
having. And then the third group is basically consumers.
They don't care about what the model has been trained
on or how it has been fine tuned. They are more like consumers who are
going to be just using that model. So consider someone who is using
your chatbot, someone who has asked a question. They would like to
get a response. They want to ensure that the response is not
having any kind of bias, toxicity,
or unrequired
responses that you will be getting. So they don't really have much of the
ML expertise, but they are basically using prompt engineering for getting
a response from the model. Be mindful.
These roles are transferable. So you can always have
a provider who's also becoming a tuner, and you can always
have a consumer who can also become a tuner.
Essentially, this is the entire spectrum that you're having,
where you have more on the MLov side, where the model
is getting created, and then you have the other end of the spectrum, where people
are directly incorporating this model into their day
to day workflows when it comes to LLM selection,
there are different aspects that you would want to consider.
The three key ones that we have seen from the field is the speed,
precision and the cost. Now, let's say you have three different llms
and each one of them is good at one particular thing. So let's say
we have LLM one, two and three. LLM one is
the best when it comes to precision. Two is the best when it comes to
cost, and we have one again
to be the best when it comes to speed. Depending on
the business scenario and the priority for a particular customer,
they can choose one of those llms. Some customers are
ready to sacrifice a bit of precision in
order to pick up a low cost LLM
because of the number of tokens that you'll be sending across and the large use
that you'll be having. You always want to have a cost effective solution in
terms of any software that you are deploying. Second is
the response time. There are different ways in which you can surely improve the response
times. Maybe you're using an embedded text,
embeddings or something like that with a vector database by which you can cache it
or you do something else. But essentially these are some of the key
factors, the three key factors that I have seen with different customers when
they are evaluating llms. And obviously this is a summarization of what
I just spoke about, which is LLM one, two and three,
how they compare. And then it's up to the customer how they want to
pick up a particular LLM and what they want to use it against.
Let's talk about customization. When it comes to customization of the llms,
there are four different ways in which I have seen the customers
to be using the llms. And one of the most common use
cases that they have is prompt engineering. So that's when you are
sending a request to the llms. For example, you're using an anthropic cloud
model on Amazon Bedrock. You are going to be using one of
the playgrounds and just send a request and ask for.
Give me details of when was the
last major incident which has happened in software
engineering around the best practices or something like that. So that's
a problem, engineering. Just asking a question. You're expecting a response from the LLC.
A more nuanced one is a retrieval augmented generation, which is
Rag, where you are able to use
Rag, which as a better solution and as a better cost
benefit, and you can use it for customizing your
llms. Using rag then comes fine tuning,
which is more time consuming, it is more complex.
There is a lot of data and other things which would be needed.
And compared to rag, fine tuning is a special
case. I would say if you really want to have that level of
control over the responses, then maybe you can think of fine
tuning. And the last would be continued pre training where you
are essentially loading the model and customizing
it way more. And obviously the complexity increases as you go
from prompt engineering to rack to fine tuning to
ultimately continued pre training. One of
the most common cases of a rush of
LLMs that has been seen is everyone tries to
start doing fine tuning, thinking that the LLMs can be made
aware of specific knowledge and facts about the organization's
code base or domain knowledge, etcetera. What has been observed
is in majority of the cases, rag is good enough.
It offers a better solution. It is more cost effective
from in terms of cost benefit ratio between rag
and fine tuning. And fine tuning requires considerably more computational
resources and expertise. It introduces even more challenges around
sensitivity and the proprietary data than rag.
And there's obviously the risk of underfitting or overfitting if you
don't have enough data which is available for fine
tuning. So do have a very clear benchmarking
to see how your model is performing with prompt engineering versus
rag. And then think about whether fine tuning
is the right solution that you want to go for without much
evaluation. You may be jumping into a
technology solution, but which may be a much more difficult thing to manage
in the long term. Now customizing, now here we
talk about customizing the business responses. So what's really going
to help drive your business in generative AI is what's important
for your customers, what's important for your products,
which you're creating, and how you go about that. And you can
leverage different mechanisms here. And this is where fine
tuning and continued pre training comes into picture.
You talk about the purpose, it's basically maximizing the accuracy of
the specific tasks that you're having.
And we have comparatively smaller number of
label data. But then when it comes to continued pre
training, that's where you want to maintain the model for
a longer duration on your specific domain.
That is hyper customizations and large number of unlabeled data
sets that you will be using. Now, as I mentioned before,
Amazon Bedrock can help remove the heavy lifting for these
kind of model customization process. But be
very clear on your use case as to
when you would be using a rag versus a fine tuning or prompt
engineering, and why you would want to use a more complex customization
than the one that you're getting. So without that clarity at a business
level, it will be quite difficult for you to just adopt the LLM
and make sure that it is viable in the long term.
Now let's talk about Amazon Bedrock.
Amazon Bedrock is basically a way for simplifying
the access to foundation models and providing an integration
layer for you via a single API which is an invoke model API.
You get access to different models which are available within Amazon bedrock.
Some of the models which you have here is the
stability AI model. You have the Amazon Titan, you have cloud,
you have Lama models, etcetera. Customers have often told us
that one of the most important features of bedrock is how easy it
makes it to experiment with and select and combine different range of
foundation models. It's still very early days and we are
all just getting started and customers are moving extremely fast.
And the key aspect is customers want to
experiment, they want to deploy, they want to iterate on whatever
they have done. And today Bedrock provides access to
wide range of foundation models from different organizations
and as well as the Amazon Titan models that you have. So once you have
access to the bedrock API itself, invoking one
of these models is extremely straightforward. I'll talk about it
in a bit. Now let's talk about the architectural patterns
that you have when using Amazon Bedrock.
Obviously with Amazon Bedrock you have different knowledge bases
for Amazon Bedrock which you will be using. And to
equip a foundation model with an up to date proprietary
information organization often talk about retrieval augmented
generation. We spoke about it a little bit earlier when during
the customization slide. It's basically a technique where
you're fetching the data from the company's data sources,
enriching the prompt with that particular data and delivering
more relevant and accurate responses. We have knowledge
bases within Amazon Bedrock which helps you
in a fully managed rag capability and it allows
you to customize the foundation model responses with contextual
and relevant company data. So essentially it
helps you securely connect to your foundation models. It's a fully managed
rag and it's a built in session context management for
multi tone conversations. And obviously you
also have automated citations with retrievals to improve the transparency
that you get. So how does it work for you?
So let's say you have a user query someone has asked about
how can I get the latest details about my statement or something.
Now that information goes into Amazon Bedrock
and it has the knowledge basis which is associated
with that particular Amazon bedrock.
And it's an iterative process going to look at the knowledge basis for Amazon Bedrock
and based on that it's going to augment the prompt that
you have received and ultimately you are going to use one of the models,
which is the foundation models, be the Claude Llama,
Titan or Jurassic models, and ultimately provide a response
to your customer. All the information that you are retrieving
as part of this process comes from the source
citations and the source which you have within the knowledge base.
And ultimately it gives you the citations to the knowledge base. In order to improve
the transparency. You also have Amazon Q, which is
an which has a similar approach when it comes to integrating with Amazon Connect.
Not something which we are covering for this particular session, but it
has a similar aspect of being able to use your
knowledge bases and then give you customized responses.
Another architectural pattern is the fine tuning. We spoke about it
earlier. So let's say you want to have a very specific
task for which you need to have fine tuning. Simply point
to those examples of that particular data, which is an
S three and they have been labeled. And then Amazon
bedrock makes a copy of the base model,
trains it, and creates a private fine tuned model so you
can get tailored responses. So how does that
work? Essentially you're making use of one of
the foundation models, be it a Lama two model or a Titan
model. For these specific tasks you are keeping all of
your specific labeled data sets in Amazon
S three, and then you are using that
data set in order to make your model
better to get tailored responses. So today you will have fine tuning
available with Lama models cohere, command,
Titan and express Titan multi model and
Titan image generator. Fine tuning will be very soon coming
into anthropic cloud models, but today it is not available.
So this creates a copy. You have the label
dataset which is in Amazon S three, and from there you are able
to fine tune the model and get the generated responses.
Now let's talk about how do you invoke these models.
One of the most common patterns that you have with respect to invoking
these models is by using API gateway. So it's
a very well tested serverless pattern which has been there
in existence even before Amazon bedrock instead of bedrock. You would be
having, I don't know, ECS or EKS or
just something running on a compute somewhere.
And you can use Amazon Lambda AWS lambda for doing that invocation
with bedrock as well. You are able to use the same pattern and
it leverages the event driven architecture that you have
been using or maybe using with Amazon API gateway.
And it doesn't always have to be Amazon API gateway. You can use it with
any integration layer which can support AWS Lambda to invoke
the bedrock APIs. And finally, instead of AWS
lambda, you can also have the same behavior which let's say if
you're having a long running compute and EC,
two ecs or eks, and then you can invoke bedrock API
in the exact same way. For this particular
example, let's consider that you are having two models which you have created
for your request and response. Payload request is saying that
you need to have a prompt which is going in and response is saying that
you have a response that is coming back and a status code that is coming
back. When you want to invoke the Amazon
bedrock endpoint, you're going to be writing a very simple
lambda code which is going to be using the boto three API.
So let's walk through this API. So you guys basically
creating a client of bedrock using the bedrock runtime.
With boto three you are creating the body which
is the prompt, the max tokens that you need to get
as a sample response, the temperature, etcetera.
And then you're selecting the model id. So here I have
selected anthropic cloud model. You can also select any
of the other model like a Titan model, or you can select the
Lama model, any of the model that you want. And once
you select the model id and you select the payload structure
that you are sending. So be mindful that this particular payload structure can change
depending on the model that you are invoking. And you can just use
invoke model and give you a response. And that
response is how you would return it back by using the same model
structure that you have used earlier. Now this particular response
request payload is structure would be differing based on the
model that you are using and the model id will also change based
on the model that you are intending to use. So that's one of
the way of invoking it if you're using Lambda and API gateway, and even if
you're not using API gateway, anything else which can integrate with that
you can use. Now let's say you're not using any lambda you just
want to use from a generic application. You can essentially use boto
three and you can use a temporary credentials in order to gain access
and ultimately invoke the bedrock API. And for any reason
if the AWS SDK is not available to you.
You can also leverage AWS SIG V four for
constructing a valid request payload and invoking the bedrock
API. So this is a similar example,
quite similar to the one that has been shown earlier.
The only difference here is we don't have the lambda handler with
the event context and the event and the context. Here we are directly using the
Botox reap and we are getting a response from here.
So you can embed it in any of the applications which has
access to the temporary credentials and you should be able to access the bedrock API.
Talking about operational excellence, one of the things that I had spoken about earlier
is having good insight into your application.
So we spoke about how do you invoke the application, how do you have the
API driven approach so that you're able to have a versioning,
you're able to have visibility of what is invoking what,
and you're able to have temporary credentials, best practices, etc.
Etcetera. Now we talk about observability that you
would be getting with Amazon bedrock and that's the invocation login.
So customers want to know what was the invocation, what was the prompt
which was sent and what kind of response did I get.
You can enable it at the bedrock level and all of these
logs can go into Amazon s three or cloudwatch or both.
Here is a sample of a log structure where you
have the input body which was sent by the requester
either via lambda or any other way by which the API has been invoked.
And you can see that this is the input someone is asking
explain the three body problem. And here the response is coming in
in terms of the number of tokens that has been given. So you will notice
that because we had given a maximum token of 300, the response
token count is 296. For the purpose of the presentation I've
just truncated what is there in the completion response.
But here you will have a response coming from the model. In this case it
was a cloud model which had been used for that. So this logging will be
available for you directly within Cloudwatch. And then from Cloudwatch
onwards you can change it to let's say s three or
maybe use it for any kind of future use. Talking about metrics,
you have these metrics available out of the box for Cloudwatch
with Amazon Bedrock, which would be your number of invocations, your latency
that you're having, any kind of client and server side errors, any throttling that
you're having, and obviously the token count, the input and output,
you saw a sample of it in the previous log structure is going to be
the same. Now, talking about the
model evaluation bedrock currently has
in preview, I believe a way for you to evaluate
the models. Now the models can be evaluated for robustness,
it can be evaluated for toxicity and accuracy.
There is on the AWS console,
you you can can essentially evaluate the model using recommended metrics.
There's an automated evaluation, but you can also choose
what kind of task you're evaluating it for. For example,
this particular screenshot is from the AWS console, which allows you to
evaluate for a question and answer scenario for Amazon bedrock.
And we are using the anthropic cloud model and
these were the responses which we received on the
accuracy and the toxicity that was evaluated
against. And you can also bring your own prompt data set
or use built and curated prompt data sets for this purpose.
So these are some of the observability and insight
related details that you can potentially use when you
are thinking about using bedrock as your single API for different
foundation models. And finally, we want to talk about the
guardrails because as we talk about generative
AI, there are different challenges around undesirable,
irrelevant topic responses or controversial queries or
responses which you will be getting, toxicity of your responses,
privacy protection bias, stereotyping propagation
and all of those things. So as we talk about
these new challenges, you also want to talk about what
kind of guardrails you will be applying for your models.
One open source solution that you have is with Nvidia Nemo
guardrails. So this is basically for building
trustworthy, safe and secure llms. So you can define the
guardrails or rails and to guide and safeguard,
guide and have a safeguarded conversation. And you can
also choose to define the behavior of your LLM based application
for specific topics and prevent it from engaging in any
discussions which are in unwanted topics. You can also
start connecting different models using LAN chain and
other services which you can have. So it's kind of like a shim
layer which is sitting between your application which
is going to be invoking an LLM. So here you can define all your
programmable guardrails and then you
can kind of steer your llms in order to follow a predefined
conversation path and enforce standard operating
procedures. So these kind of standard operating procedures
are some of the core context
when it comes to building an operational excellence practice,
especially when you are building out the llms. So these are some
of the same points which I have mentioned. And you
can have a look at GitHub and there is, I'll give you
a link towards the end of the session as well,
where within Amazon bedrock samples
are available which you can take a look at it and you can see how
the guardrails have been incorporated, is basically a config
YAML file and you give an
input rail and the input rails are basically applied to
the inputs from the user and it can reject the input or
we can stop any additional processing. Then you have the dialogue
rails which is to influence how the LLM is prompted
and they operate on the canonical form messages. You have the
retrieval rails which are applied on the retrieval chunk. In the case of
say rag scenario, a retrieval rail can reject
a chunk or prevent it from being used to prompt the LLM. You have
the execution rail and finally you have the output rails. So these
are like five different levers which you can control
and you can write your config in a config yaml.
If you go into the GitHub for this particular guardrail,
the Nemo guardrails, you will be able to find more details in
there. But this is just an introduction of what kind of guardrails you can add
into your LLM invocation. So that you are
ensuring that it's safe and you're ensuring
the responsible AI best practices when using llms.
And finally, this is how it would look when you are using it with Amazon
bedrock where you would have the central layer of all the guardrails
that you are having. You have the invoker in here and ultimately the bedrock model
coming in here and the Neemo guardrails would apply at the central layer.
So that is the shim layer that is sitting between your LLM
which is exposed via the bedrock and then the invoker who is
giving that. And finally
the GitHub handle where you can find the details of Amazon bedrock workshop.
And this is a screenshot of the UI that you have that
closes every, all the topics that I wanted to cover for this session.
Talking all the way from what bedrock is what
it is offering. How do you invoke bedrock, what kind
of observability you're getting out of the box. And finally the guardrails
which you can apply for bedrock. Hope this helps.
Thank you so much for your time. It's been a pleasure.