Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm Gayathri Shivraj and I'm honored to be a speaker at the
Con 42. I'm a senior program manager at Amazon.
In the fulfillment services, I primarily focus on
program and product excellence to provide best in class seller
experience by optimizing the storage and fulfillment capabilities.
Large language models are a big part of the products we build as we
deal with large datasets of seller communication across
different modalities worldwide.
Before we dive into the details, let's take a quick look at the agenda
for today's presentation. We have a lot of ground to cover,
and I want to ensure we have a structured approach to understanding how
large language models can be leveraged for advanced AI applications.
We will start with an introduction to large language models,
or LLMs. This section will provide a foundational
understanding of what LLMs are, their significance in the field of AI,
and why they have become so prominent in the recent years.
Next, we will dwell into the architecture of LLMs.
We will explore how these models are built, the underlying technologies
that power them, and the key components that make them effective at processing
and generating human like text.
Next, we will talk about methods for leveraging LLMs.
In this section, I will discuss how to use LLMs
effectively by leveraging APIs and interactive playgrounds.
I'll explain how to deploy these models for production use cases,
ensuring scalability and reliability.
Additionally, we will cover how to customize LLMs to
meet specific needs and how to deploy these customized versions,
and how to create and use effective prompts to get
the best results from LLMs.
Next limitations of using LLMs while LLMs
are powerful, they come with their own set of constraints and challenges.
We will cover the limitations, potential pitfalls, and ethical considerations
when deploying these models in real world scenarios.
And finally, with the real world success stories,
we will look at some of the real world success stories. I will
share case studies and examples of how organizations,
including Amazon, have successfully implemented LLMs
to solve complex problems, improve efficiency,
and enhance customer experiences.
Let's take a closer look at what LLMs are, their key
components, capabilities, and applications across various industries.
What are LLMs? Large language models are advanced
AI models trained on extensive datasets to understand
and generate human like language. These models are
designed to perform a wide range of language related tasks,
making them incredibly versatile and powerful tools in the
field of AI. The key components of
LLMs transformer architecture at
the heart of LLMs is the transformer architecture.
This architecture allows the model to handle long reach dependencies
in text, making it possible to generate coherent
and contextual relevant responses.
Pre trained parameters LLMs come with
millions, billions, and sometimes trillions of pre trained parameters.
These parameters are learned from vast amounts of text data,
enabling the model to understand language nuances
and context. Finally, fine tuning
after pre training, LLMs can be fine tuned on specific
data sets to adapt to particular tasks or domains.
This fine tuning process tailors the model's capabilities to
meet specific needs more effectively.
Capabilities of LLMs content generation and comprehensions
LLMs excel at text generation, allowing them to create human
like text based on given prompts. They can also perform question
answering, providing relevant and accurate responses to user queries.
Language processing these models are capable
of language translation and summarization,
breaking down language barriers, and condensing information
into more digestible formats. Analysis and
recognition LLMs can analyze sentiments,
classify text, and recognize named entities,
making them useful for tasks such as sentiment analysis, text classification,
and named entity recognition applications
across industries. Software development in software
development, LLMs facilitate code summarization,
natural code search, and automated documentation generation.
These capabilities enhance developer productivity and
improve code understanding. Learning LLMs
can serve as education tools for learning programming languages.
They provide personalized feedback and tutoring to
aspiring developers and support the creation of attractive
coding exercises and adaptive learning platforms.
Thank you, Gayathri hello, I'm Satya
and thank you for the opportunity to speak at Conf 42.
I am a senior engineer at Amazon in the brand protection organization.
In my role, my team and I build systems to protect the integrity of
our website by monitoring and preventing infringements and counterfeits.
We focus on preventing the misuse of IP of brands,
ensuring that our customers can shop with confidence in
building these systems. We leverage multiple LLMs and multimodal
LLMs to accomplish this goal.
Let's quickly delve into the architecture of transformers.
The transformers architecture represents a significant breakthrough
in the field of natural language processing and serves as a backbone
for many state of the art LLMs.
The transformer architecture is a game changer in the field of LLMs
because it overcomes the limitations of previous architectures like
RNN's and LSTM networks. The problem
with RNN's is they tend to forget important information from
earlier in a sequence because they process words
one by one, making them very slow and less accurate for very
long ticks. Lstms improve memory retention,
but they are still slow since they also handle words
sequentially. Some of the key components
of transformers architecture are self attention mechanism,
positional encodings, feed forward neural networks,
encoder decoder, multi head attention layer normalization,
and residual connections. The fundamental parts are
basically the encoder and the decoder. And it
all started when a paper was released in 2017
which had the title attention is all you need, and that's from
Washirani and others. And going
into the architecture of LLMs,
the self attention mechanism think of the self
attention mechanism as a way for the model to look at all the
words in a sentence and decide which ones are most important.
And for example, in the sentence
the cat sat on the mat. The word cat may pay more
attention to words sat and mat because
they are closely related.
Now, the second component of transformers
architecture, the positional encoding, is just because
the transformers don't naturally understand the words in which
they are ordered. Positional encoding helps the model know
the position of each word in the sentence. Then comes the
feed forward neural networks the
feed forward neural networks after applying the self
attention and positional encoding, the process tokens are
passed through the free forward neural networks within each layer
of the transformer. These feed forward neural networks
consist of multiple fully connected layers with nonlinear
activation functions, for example,
relu, and they enable the model to
learn complex patterns and representations from the input
data. Then the
next part is the encoder decoder. The encoder component
is basically responsible for translating the input sequence.
It processes the input sequence while the decoder generates the output
sequence.
The next one is the multi head.
Attention. To enhance the learning capabilities
of the LLM and capture different types of information,
transformers typically employ multi head attention mechanisms.
This feature allows the model to focus on certain parts of the input
sentence, simultaneously enhancing its understanding
of the text. For example, in a translation task,
one part of the model might focus on nouns, while the
other part focuses on verbs. These points of focus
can be referred as heads.
Then the last part of it is the layer normalization and residual connections
to stabilize the training process and ensure that
the model learns efficiently by
allowing information to flow smoothly between the layers.
This layer normalization technique is used to normalize the
weights.
Pre trained parameters are the numerical values associated
with the connections between neurons in the neural network architecture
of an LLM. These parameters represent the learned knowledge and
patterns extracted from the training data during the pre training
phase. As the model processes the input text,
it adjusts its parameters, that is, weights and biases,
to minimize a predefined loss function, such as cross entropy
loss, applied to the pre training objective,
the components of pre trained parameters are
word embeddings, parameters that represent the
initial numerical representations of word words or subwords
in the vocabulary. Word embeddings capture semantic
similarities between words based on their contextual usage in the
training data transformer layers parameters
associated with the multiple layers of the transformer architecture used in LLMs.
These layers include self attention mechanisms and feed forward
neural networks. Output layer parameters
parameters of the output layer, which map the final hidden
states of the model to predictions for specific tasks
such as classification and task generation.
Fine tuning is the process of adjusting the parameters
of a pre trained large language model to a specific task or
domain. Although pre trained language models possess vast
language knowledge, they lack specialization in specific areas.
Fine tuning addresses this limitation by allowing the
model to learn from domain specific data to be more accurate
and for targeted applications. Some of
the commonly used fine tuning techniques are hyperparameterization.
It is a simple approach that involves manually adjusting the model
hyperparameters such as the learning rate, batch size,
and the number of epochs until you achieve the desired performance.
One or few shot learning enables a model to
adapt to a new task with little task specific data.
In this technique, the model is given one or few examples
during inference time to learn a new task.
The idea behind this approach is to guide the model's predictions by
providing context and examples directly in the prompt.
This approach is beneficial when the task specific label
data is scarce or expensive.
Domain adaptation Domain adaptation is particularly
valuable when you want to optimize the model's performance for a single
well defined task, ensuring that the model excels
in generating task specific content with precision and accuracy.
Now look at how we can leverage large language models in
our day to day life.
So for developers you have multiple
ways to leverage LLMs starting from directly using their
APIs. Are playgrounds with standalone
models to developing your own customized
models for basically
for your own domain or use case and deploying it in your own custom
environments or hosts.
The first easiest way to interact with LLMs
is by using playgrounds or direct
API integration. Here is one such example where you can use
AWS bedrock to load the anthropoclad
v two model and then invoke the model with a specific prompt.
There are other uis are playgrounds
that don't need any coding and you would be able to interact
with those everyone knows about chat, GPT and
AWS. Bedrock also has playground where you can basically
give your prompts and get responses. Some of the bedrock some
of the models that are supported by bedrock are listed here.
You have Jurassic Titan command,
Lama Mistral for text generation. Then for
the image generation there is titan image generator and stability
diffusion from stability AI. There are multimodal
models as well on bedrock like
cloud three, Haiku and cloud three sonnet.
These two are the popular ones and there
is another one from anthropic which is cloud three
opus. And apart from this for the
similarity similar similarity search based use cases you have a
couple of embeddings model embedding based models on bedrock
as well.
This is an example of a screen that is taken
from AWS bedrock. Here is a playground
on cloud based on cloud this
is one of the prompts that you can use and tune some of the parameters
here to get your inference response.
Here is a question and here is the answer. You can format the question such
a way that your answer is well
structured. We'll get into this in the prompt engineering section that
will later be explained by Gayathri in the
upcoming slides. One of the
easier ways that I have discussed is using playgrounds
or APIs. You can use hugging face for it.
Let me show a quick demo of how you can use
hugging face. Here is the model
hub for hugging face. You can see that there are bunch of models
listed on hugging face. Let's search for
Mistral.
Let's go with the Mistral seven b instruct model.
Here is an example of the playground that they have. This is a serverless one.
It's free. It can be used for experimentation.
It is hugging face.
It gives you an opportunity to test out a few models before
using it for production use cases.
Even though this is free, they throttle you based on
the API key that you provide.
It cannot be used for production use cases because you will not be able to
get the guarantee on availability.
They give you options around deploying the model as
a dedicated endpoint. Here's one such option where you can deploy
the model. If it is a standalone model and
you wanted a standalone version of the model, then you can deploy it on
your one of the cloud service providers.
Here are the costs and you can choose one of the instance types and
this is a very seamless integration with the endpoint.
You'll be charged based on the usage per hours.
There is another way where you could. You could take
control of the you could take control of the
host as well. If you want to deploy it on your AWS sagemaker
accounts for your service or for your application, you can do that as well.
They they provide you the code on how to deploy
it. This is one such example.
They also give options to deploy it in
Azure and Google Cloud. They have.
They also give options to train the model
or fine tune the model.
Here is a serverless inference for prototyping. This is
an image segmentation kind of use case and this
is a very simple code that you can use to call
any model host run
hanging face.
This is the code for deploying the hugging face model directly on to
your AWS account on the sagemaker or
you can choose to deploy it. This is the same code that
you get once you click this button.
This is a this is what we have seen in the demo
and you have other option of deploying using AWS SageMaker
Studio to find the foundational models from
hugging face or from the other repositories and then
deploy it on the sagemaker instance or train it or evaluate
compare it with other models. They have good tools
to do that.
You can also deploy the customized version according to your
domain or fine tune version according to your domain.
AWS bedrock offers you easier
ways to fine tune it based
on the foundation models that you choose.
They also allow you to custom import a model,
but that is supported just for Mistral, Flanti, five and Lama.
As of now,
you can also write your own custom inference code,
write your own get your own model artifacts
and deploy it using on a gpu
or on a CPU or AWS inferential chips
along with the custom images that AWS provides,
and then host it yourself.
And this is a sample inference code that you can use to
deploy your own custom model.
Now let's talk about some of the limitations of standalone LLMs.
First, LLMs can sometimes produce content that is inaccurate
or completely fabricated, known as hallucinations.
This can be problematic, especially in applications requiring precise
and reliable information. Secondly, LLMs struggle
with providing up to date information because they are trained on data
available up to a certain cutoff point.
Any developments or changes that occur after this point
won't be reflected in their responses.
Another challenge is that general purpose LLMs often
have difficulty handling domain specific queries effectively.
They might not have the specialized knowledge needed for
specific industries or fields without further customization.
Limited contextual understanding is also a concern.
LLMs may not always grasp the full context of complex
queries or conversations, leading to responses that are
off target or incomplete. Ethical and
bias issues are significant as well. These models
can sometimes produce biased or ethically questionable
outputs reflecting biases present in the training data.
Fine tuning large LLMs to improve their performance
for specific tasks requires substantial computational resources,
which can be costly and time consuming.
Lastly, handling of potentially sensitive data underscores
the importance of stringent data governance.
For the limitations that were discussed in the previous slide, we can have
a system called rag to reduce the problems
caused by the hallucinations. What is Rac rag
is basically a advanced AI approach that combines the strength
of retrieval systems with generative models.
It aims to enhance the capabilities of LLMs by grounding
generated responses in factual information retrieved from
knowledge basis. How does a rack work?
It has two components, retrieval component and generative component.
The retrieval system features relevant documents or
pieces of information from a predefined knowledge base based on
the user's query. Techniques such as
keyword matching, semantic search, or vector based retrieval ensures
accurate and contextually relevant information is being retrieved.
The second component is a generative component.
The generative model typically in LLM uses these retrieved
information from the retrieval component to generate
coherent and contextually enriched responses.
This integration allows the LLM to provide answers that are
not fluently, that are not only fluent, but all but also
contextually relevant.
What are the coming to the benefits of Frag?
The first one is the improved accuracy.
By incorporating retriever retrieved factual information,
Rag significantly reduces the likelihood of generating incorrect
responses. And by providing
more context from the information that is retrieved by the retrieval component,
you can have more contextual relevance on
the responses that you get.
Like the limitation that was discussed earlier, the knowledge cut
off. You can use the rag. You can populate
the rag with up to date information to get the up to
date knowledge and ask queries based on that up to
date knowledge.
And how do you implement a Rag?
Typically it contains four steps. First one is selecting a
knowledge database. This is a company's internal
database. You can have it as a vector database. You can have it as a
keyword store or anything where you can comprehensively
put all the documents that are relevant to your company or
for the domain. Then next step is the data preparation.
You clean up the data, have the data structured,
choose a good storage solution where you can
efficiently retrieve the data on demand.
So techniques such as vector based search are
techniques to store the knowledge as embeddings
would help a lot in this particular step.
There are custom solutions available in the market like
AWS open search and AWS document
DB to store these documents. There is another
system called Pinecone which is a popular
vector database.
You can index all the documents that are relevant to your company's
knowledge into that vector database as embeddings using one
of the FIAS storage techniques
using the fyess engine and then use KNN to basically retrieve those
documents. Then third part is the retrieval develop
the retrieval system. This system, usually we aim
to be very fast, so we try to
do a KNN search on the database.
Typically a KNN search or you can probably do a semantic search
or a keyword based search. The retrieval system has to be
fast to give the documents more relevant documents
so that you can plug it into the LLM as part of its context.
The fourth is the combining the retrieval responses and
adding it as a input query to the LLM before forming
your question. So apart
from the knowledge limitations that are also typically cost concerns and around
the LLMs, typically these LLMs are resource
intensive. They require high computational requirements.
For example, the Lama three 8 billion parameter model
and 70 billion parameter model. You would need
a minimum of g 512 x large for the 8 billion parameter model and p
large for the 70 billion parameter model. And you're looking at
a cost of $7 per hour for the twelve x largest
and $37 for the 24 x largest.
And they have four to eight Nvidia
GPU's of different configurations.
And second, you are looking at a high cost and
the maintaining maintenance of this knowledge basis.
Typically if your knowledge base is huge like we
have it on my team, we have a
huge knowledge base of infringements of around 20 billion documents.
Sorry, my bad, 2 billion documents that costs
us around million dollars a year. So unless
you choose some optimized ways of storing these documents, like IVF
flat format or IVF product
quantization techniques applied
to the document are choosing it as indexing strategies.
While indexing will help a lot in reducing the cost,
the third is the operational maintenance costs. The maintenance
of LLMs is a significant factor because you have to scale up the
LLM according to your needs. You have to basically
fine tune it. It also, the fine tuning process
is also kind of slightly expensive because you would need
to procure more hosts for fine tuning and then typically
you run into availability issues.
Some of the cost reduction strategies that we can look at is basically
using if your system, if your
use cases do not warrant for a deployment of
a fine tuned model, then you can use pre trained models and
you can interact with them with APIs and other
offerings by the cloud service providers. Typically they charge you
by the request so you don't have to bear the upfront cost of hosting
it and keeping it alive.
Then you can also leverage foundation models of
model offerings by cloud service provider like AWS, bedrock and
Sagemaker. They have a good set of popular models where
you can directly use it without having to host it yourself. Then you
can optimize a large model into a smaller model
by model distillation, transfer the knowledge of the larger model to a smaller
model, distill that knowledge and then have a smaller model.
Run your request, process your requests
and you can also do quantization by changing the precision
for the model from FP 32
to FP 16, which will bring down the memory.
And you can also prune the model to remove the unnecessary weights
or layers and probably reduce the size of
the model significantly. Then for efficient
resource utilization, you can choose to configure
auto scaling, automated scale out
and scale in based on your traffic patterns. And then you can
batch more and then go with an asynchronous invocation
where you don't need the response immediately. You can reserve some of the instances
on Sagemaker and other cloud service providers so that you can procure
the host at a cheaper cost. You can cache your responses. These are some of
the strategies that you can employ then for the data management for
hosting knowledge databases are indexing solutions. You have
IVF flat, IVF PQ. You can prefer these
techniques indexing techniques instead of
storing the documents in HNSW format to
reduce the memory and thereby reducing your costs,
you can also use model cascading. You can deploy the smaller
versions of the model or low precision models at cheaper cost as
a filter. And then for those requests that come
out of these smaller models, you can probably use a complex
model to look at some of the complex patterns.
So just like a filtering technique, you can do the model
cascading as well.
Prompt engineering is about crafting inputs that guide the
model towards the desired output. A good example
of an effective prompt should contain contextual information
about the task. Reference text for the task clear
and complete instruction clear instruction at
the end of the prompt and as an option, you can specify
the format of the output for
a task like text classification. Here is a good example by
anthropic cloud, where you have the description of the task,
reference text for the task, and the classification labels.
Another example of question answer based prompt
you need to provide the instruction reference based text
and at the end you have a clear and concise question for
the text summarization task, you have the text for the
reference text and a clear instruction to summarize
it in the format you choose.
For code generation, a clear instruction on what you want,
and the specific programming language that you need the code to be in.
Large language models offers a myriad of applications for
both software engineers and tech professionals. Let's explore
some of these practical uses. As a software engineer,
automated code generation can significantly speed up development by
handling repetitive tasks and providing code suggestions.
For instance, GitHub copilot can generate code snippets
based on comments. LLMs assist
in code review and debugging by identifying potential bugs
and suggesting fix. Similar to tools like deep code and codeguru,
generating documentation becomes easier with LLMs,
which can create detailed doc strings, readme files,
and API documentation from the code base.
Natural language interfaces allow for more intuitive software
interactions, enabling users to perform tasks using chatbots
or voice assistants. As a tech professional,
technical support is enhanced with AI driven chatbots that
provide first level support, reducing the burden on human teams
and improving response times.
LLMs can analyze data, generate reports, and extract insights
from textual data, aiding in decision making and strategy
formulation. Content creation for marketing documentation
and internal communications can be automated,
streamlining workflows and ensuring consistency.
Training programs powered by LLMs offer personalized learning experiences,
making knowledge sharing more efficient and interactive.
In conclusion, by leveraging LLMs, both software engineers
and tech professionals can enhance productivity,
improve efficiency and innovate in their respective fields.
Coming to how we do it in our brand protection organization
how we leverage LLM we leverage LLM
for trademark and copyright wireless detections.
We analyze the brand names, logos and other intellectual properties
on the product listings and we
try to identify the brands to whom the trademarks belongs
to. We have a corpus of trademarks
and copyrights belonging to the brands for it for,
I believe, trademarks for around 1 million,
sorry, around 100k brands and copyrights
and logos for another 50,000 brands.
For the counterfeit detection we do use LLMs to notice
to recognize subtle differences between genuine and fake product listings
and the lms are also very helpful in
detecting obfuscations like people who use n
one ke instead of Nike and for analyzing
behavioral and analytics of the seller behavior and
some of the real world examples that we have on our site,
the last one being ours. The first three are public now.
I guess everybody
now can see review highlights on the product listings page
of Amazon you see a summary of what
customers say. Then there is a this
is early access for the offered offer to the sellers when
they create listings on Amazon.
The LLMs can generate content based on very
small description of the product that you are selling.
It can fill the gaps or it can fill more details about the product.
Then Amazon pharmacy started
using this LLMs recently to answer questions more
quickly because the LLMs can now look at
the whole corpus of internal wikis and provide more info,
more information on the drugs, and much more.
Quickly then in our space, we reduce the human
audits for detecting infringements by 80% for famous brands like Apple,
et cetera.
For hard to find copyright violations,
we run the LLM for around 1
million products a day.
And the final output coming out of the LLMs
that is flagged for deeper look
is around 20% of those 1
million. So around two hundred k.
And finally, thank you for this opportunity.