Conf42 Observability 2024 - Online

Leveraging Large Language Models for Advanced AI Applications: A Comprehensive Guide

Video size:

Abstract

Unveil how Large Language Models (LLMs) like GPT and Turing NLG have transformed trademark identification, reducing audits by 80% and doubling productivity in content creation. Explore their impact across industries, with a 50% boost in material recovery and a 40% improvement in customer service.

Summary

  • Gayathri Shivraj is a senior program manager at Amazon. He focuses on program and product excellence to provide best in class seller experience. Large language models are a big part of the products we build. He will be a speaker at the Con 42.
  • Large language models can be leveraged for advanced AI applications. While LLMs are powerful, they come with their own set of constraints and challenges. We will cover the limitations, potential pitfalls, and ethical considerations when deploying these models in real world scenarios. Finally, we will look at some of the real world success stories.
  • Large language models are advanced AI models trained on extensive datasets to understand and generate human like language. These models are designed to perform a wide range of language related tasks, making them incredibly versatile and powerful tools. Let's take a closer look at what LLMs are, their key components, capabilities, and applications across various industries.
  • Satya is a senior engineer at Amazon in the brand protection organization. The transformers architecture represents a significant breakthrough in the field of natural language processing. It overcomes the limitations of previous architectures like RNN's and LSTM networks. Fine tuning is the process of adjusting the parameters of a pre trained large language model.
  • For developers you have multiple ways to leverage LLMs starting from directly using their APIs. The first easiest way to interact with LLMs is by using playgrounds or direct API integration. Are playgrounds with standalone models to developing your own customized models for basically for your own domain or use case.
  • You can use hugging face for experimentation. It gives you an opportunity to test out a few models before using it for production use cases. They give you options around deploying the model as a dedicated endpoint. They also give options to train the model or fine tune the model.
  • Rag is an advanced AI approach that combines the strength of retrieval systems with generative models. It aims to enhance the capabilities of LLMs by grounding generated responses in factual information retrieved from knowledge basis. The benefits of Frag include improved accuracy and contextual relevance.
  • We leverage LLM for trademark and copyright wireless detections. We analyze the brand names, logos and other intellectual properties on the product listings. For hard to find copyright violations, we run the LLm for around 1 million products a day. And finally, thank you for this opportunity.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi everyone. I'm Gayathri Shivraj and I'm honored to be a speaker at the Con 42. I'm a senior program manager at Amazon. In the fulfillment services, I primarily focus on program and product excellence to provide best in class seller experience by optimizing the storage and fulfillment capabilities. Large language models are a big part of the products we build as we deal with large datasets of seller communication across different modalities worldwide. Before we dive into the details, let's take a quick look at the agenda for today's presentation. We have a lot of ground to cover, and I want to ensure we have a structured approach to understanding how large language models can be leveraged for advanced AI applications. We will start with an introduction to large language models, or LLMs. This section will provide a foundational understanding of what LLMs are, their significance in the field of AI, and why they have become so prominent in the recent years. Next, we will dwell into the architecture of LLMs. We will explore how these models are built, the underlying technologies that power them, and the key components that make them effective at processing and generating human like text. Next, we will talk about methods for leveraging LLMs. In this section, I will discuss how to use LLMs effectively by leveraging APIs and interactive playgrounds. I'll explain how to deploy these models for production use cases, ensuring scalability and reliability. Additionally, we will cover how to customize LLMs to meet specific needs and how to deploy these customized versions, and how to create and use effective prompts to get the best results from LLMs. Next limitations of using LLMs while LLMs are powerful, they come with their own set of constraints and challenges. We will cover the limitations, potential pitfalls, and ethical considerations when deploying these models in real world scenarios. And finally, with the real world success stories, we will look at some of the real world success stories. I will share case studies and examples of how organizations, including Amazon, have successfully implemented LLMs to solve complex problems, improve efficiency, and enhance customer experiences. Let's take a closer look at what LLMs are, their key components, capabilities, and applications across various industries. What are LLMs? Large language models are advanced AI models trained on extensive datasets to understand and generate human like language. These models are designed to perform a wide range of language related tasks, making them incredibly versatile and powerful tools in the field of AI. The key components of LLMs transformer architecture at the heart of LLMs is the transformer architecture. This architecture allows the model to handle long reach dependencies in text, making it possible to generate coherent and contextual relevant responses. Pre trained parameters LLMs come with millions, billions, and sometimes trillions of pre trained parameters. These parameters are learned from vast amounts of text data, enabling the model to understand language nuances and context. Finally, fine tuning after pre training, LLMs can be fine tuned on specific data sets to adapt to particular tasks or domains. This fine tuning process tailors the model's capabilities to meet specific needs more effectively. Capabilities of LLMs content generation and comprehensions LLMs excel at text generation, allowing them to create human like text based on given prompts. They can also perform question answering, providing relevant and accurate responses to user queries. Language processing these models are capable of language translation and summarization, breaking down language barriers, and condensing information into more digestible formats. Analysis and recognition LLMs can analyze sentiments, classify text, and recognize named entities, making them useful for tasks such as sentiment analysis, text classification, and named entity recognition applications across industries. Software development in software development, LLMs facilitate code summarization, natural code search, and automated documentation generation. These capabilities enhance developer productivity and improve code understanding. Learning LLMs can serve as education tools for learning programming languages. They provide personalized feedback and tutoring to aspiring developers and support the creation of attractive coding exercises and adaptive learning platforms. Thank you, Gayathri hello, I'm Satya and thank you for the opportunity to speak at Conf 42. I am a senior engineer at Amazon in the brand protection organization. In my role, my team and I build systems to protect the integrity of our website by monitoring and preventing infringements and counterfeits. We focus on preventing the misuse of IP of brands, ensuring that our customers can shop with confidence in building these systems. We leverage multiple LLMs and multimodal LLMs to accomplish this goal. Let's quickly delve into the architecture of transformers. The transformers architecture represents a significant breakthrough in the field of natural language processing and serves as a backbone for many state of the art LLMs. The transformer architecture is a game changer in the field of LLMs because it overcomes the limitations of previous architectures like RNN's and LSTM networks. The problem with RNN's is they tend to forget important information from earlier in a sequence because they process words one by one, making them very slow and less accurate for very long ticks. Lstms improve memory retention, but they are still slow since they also handle words sequentially. Some of the key components of transformers architecture are self attention mechanism, positional encodings, feed forward neural networks, encoder decoder, multi head attention layer normalization, and residual connections. The fundamental parts are basically the encoder and the decoder. And it all started when a paper was released in 2017 which had the title attention is all you need, and that's from Washirani and others. And going into the architecture of LLMs, the self attention mechanism think of the self attention mechanism as a way for the model to look at all the words in a sentence and decide which ones are most important. And for example, in the sentence the cat sat on the mat. The word cat may pay more attention to words sat and mat because they are closely related. Now, the second component of transformers architecture, the positional encoding, is just because the transformers don't naturally understand the words in which they are ordered. Positional encoding helps the model know the position of each word in the sentence. Then comes the feed forward neural networks the feed forward neural networks after applying the self attention and positional encoding, the process tokens are passed through the free forward neural networks within each layer of the transformer. These feed forward neural networks consist of multiple fully connected layers with nonlinear activation functions, for example, relu, and they enable the model to learn complex patterns and representations from the input data. Then the next part is the encoder decoder. The encoder component is basically responsible for translating the input sequence. It processes the input sequence while the decoder generates the output sequence. The next one is the multi head. Attention. To enhance the learning capabilities of the LLM and capture different types of information, transformers typically employ multi head attention mechanisms. This feature allows the model to focus on certain parts of the input sentence, simultaneously enhancing its understanding of the text. For example, in a translation task, one part of the model might focus on nouns, while the other part focuses on verbs. These points of focus can be referred as heads. Then the last part of it is the layer normalization and residual connections to stabilize the training process and ensure that the model learns efficiently by allowing information to flow smoothly between the layers. This layer normalization technique is used to normalize the weights. Pre trained parameters are the numerical values associated with the connections between neurons in the neural network architecture of an LLM. These parameters represent the learned knowledge and patterns extracted from the training data during the pre training phase. As the model processes the input text, it adjusts its parameters, that is, weights and biases, to minimize a predefined loss function, such as cross entropy loss, applied to the pre training objective, the components of pre trained parameters are word embeddings, parameters that represent the initial numerical representations of word words or subwords in the vocabulary. Word embeddings capture semantic similarities between words based on their contextual usage in the training data transformer layers parameters associated with the multiple layers of the transformer architecture used in LLMs. These layers include self attention mechanisms and feed forward neural networks. Output layer parameters parameters of the output layer, which map the final hidden states of the model to predictions for specific tasks such as classification and task generation. Fine tuning is the process of adjusting the parameters of a pre trained large language model to a specific task or domain. Although pre trained language models possess vast language knowledge, they lack specialization in specific areas. Fine tuning addresses this limitation by allowing the model to learn from domain specific data to be more accurate and for targeted applications. Some of the commonly used fine tuning techniques are hyperparameterization. It is a simple approach that involves manually adjusting the model hyperparameters such as the learning rate, batch size, and the number of epochs until you achieve the desired performance. One or few shot learning enables a model to adapt to a new task with little task specific data. In this technique, the model is given one or few examples during inference time to learn a new task. The idea behind this approach is to guide the model's predictions by providing context and examples directly in the prompt. This approach is beneficial when the task specific label data is scarce or expensive. Domain adaptation Domain adaptation is particularly valuable when you want to optimize the model's performance for a single well defined task, ensuring that the model excels in generating task specific content with precision and accuracy. Now look at how we can leverage large language models in our day to day life. So for developers you have multiple ways to leverage LLMs starting from directly using their APIs. Are playgrounds with standalone models to developing your own customized models for basically for your own domain or use case and deploying it in your own custom environments or hosts. The first easiest way to interact with LLMs is by using playgrounds or direct API integration. Here is one such example where you can use AWS bedrock to load the anthropoclad v two model and then invoke the model with a specific prompt. There are other uis are playgrounds that don't need any coding and you would be able to interact with those everyone knows about chat, GPT and AWS. Bedrock also has playground where you can basically give your prompts and get responses. Some of the bedrock some of the models that are supported by bedrock are listed here. You have Jurassic Titan command, Lama Mistral for text generation. Then for the image generation there is titan image generator and stability diffusion from stability AI. There are multimodal models as well on bedrock like cloud three, Haiku and cloud three sonnet. These two are the popular ones and there is another one from anthropic which is cloud three opus. And apart from this for the similarity similar similarity search based use cases you have a couple of embeddings model embedding based models on bedrock as well. This is an example of a screen that is taken from AWS bedrock. Here is a playground on cloud based on cloud this is one of the prompts that you can use and tune some of the parameters here to get your inference response. Here is a question and here is the answer. You can format the question such a way that your answer is well structured. We'll get into this in the prompt engineering section that will later be explained by Gayathri in the upcoming slides. One of the easier ways that I have discussed is using playgrounds or APIs. You can use hugging face for it. Let me show a quick demo of how you can use hugging face. Here is the model hub for hugging face. You can see that there are bunch of models listed on hugging face. Let's search for Mistral. Let's go with the Mistral seven b instruct model. Here is an example of the playground that they have. This is a serverless one. It's free. It can be used for experimentation. It is hugging face. It gives you an opportunity to test out a few models before using it for production use cases. Even though this is free, they throttle you based on the API key that you provide. It cannot be used for production use cases because you will not be able to get the guarantee on availability. They give you options around deploying the model as a dedicated endpoint. Here's one such option where you can deploy the model. If it is a standalone model and you wanted a standalone version of the model, then you can deploy it on your one of the cloud service providers. Here are the costs and you can choose one of the instance types and this is a very seamless integration with the endpoint. You'll be charged based on the usage per hours. There is another way where you could. You could take control of the you could take control of the host as well. If you want to deploy it on your AWS sagemaker accounts for your service or for your application, you can do that as well. They they provide you the code on how to deploy it. This is one such example. They also give options to deploy it in Azure and Google Cloud. They have. They also give options to train the model or fine tune the model. Here is a serverless inference for prototyping. This is an image segmentation kind of use case and this is a very simple code that you can use to call any model host run hanging face. This is the code for deploying the hugging face model directly on to your AWS account on the sagemaker or you can choose to deploy it. This is the same code that you get once you click this button. This is a this is what we have seen in the demo and you have other option of deploying using AWS SageMaker Studio to find the foundational models from hugging face or from the other repositories and then deploy it on the sagemaker instance or train it or evaluate compare it with other models. They have good tools to do that. You can also deploy the customized version according to your domain or fine tune version according to your domain. AWS bedrock offers you easier ways to fine tune it based on the foundation models that you choose. They also allow you to custom import a model, but that is supported just for Mistral, Flanti, five and Lama. As of now, you can also write your own custom inference code, write your own get your own model artifacts and deploy it using on a gpu or on a CPU or AWS inferential chips along with the custom images that AWS provides, and then host it yourself. And this is a sample inference code that you can use to deploy your own custom model. Now let's talk about some of the limitations of standalone LLMs. First, LLMs can sometimes produce content that is inaccurate or completely fabricated, known as hallucinations. This can be problematic, especially in applications requiring precise and reliable information. Secondly, LLMs struggle with providing up to date information because they are trained on data available up to a certain cutoff point. Any developments or changes that occur after this point won't be reflected in their responses. Another challenge is that general purpose LLMs often have difficulty handling domain specific queries effectively. They might not have the specialized knowledge needed for specific industries or fields without further customization. Limited contextual understanding is also a concern. LLMs may not always grasp the full context of complex queries or conversations, leading to responses that are off target or incomplete. Ethical and bias issues are significant as well. These models can sometimes produce biased or ethically questionable outputs reflecting biases present in the training data. Fine tuning large LLMs to improve their performance for specific tasks requires substantial computational resources, which can be costly and time consuming. Lastly, handling of potentially sensitive data underscores the importance of stringent data governance. For the limitations that were discussed in the previous slide, we can have a system called rag to reduce the problems caused by the hallucinations. What is Rac rag is basically a advanced AI approach that combines the strength of retrieval systems with generative models. It aims to enhance the capabilities of LLMs by grounding generated responses in factual information retrieved from knowledge basis. How does a rack work? It has two components, retrieval component and generative component. The retrieval system features relevant documents or pieces of information from a predefined knowledge base based on the user's query. Techniques such as keyword matching, semantic search, or vector based retrieval ensures accurate and contextually relevant information is being retrieved. The second component is a generative component. The generative model typically in LLM uses these retrieved information from the retrieval component to generate coherent and contextually enriched responses. This integration allows the LLM to provide answers that are not fluently, that are not only fluent, but all but also contextually relevant. What are the coming to the benefits of Frag? The first one is the improved accuracy. By incorporating retriever retrieved factual information, Rag significantly reduces the likelihood of generating incorrect responses. And by providing more context from the information that is retrieved by the retrieval component, you can have more contextual relevance on the responses that you get. Like the limitation that was discussed earlier, the knowledge cut off. You can use the rag. You can populate the rag with up to date information to get the up to date knowledge and ask queries based on that up to date knowledge. And how do you implement a Rag? Typically it contains four steps. First one is selecting a knowledge database. This is a company's internal database. You can have it as a vector database. You can have it as a keyword store or anything where you can comprehensively put all the documents that are relevant to your company or for the domain. Then next step is the data preparation. You clean up the data, have the data structured, choose a good storage solution where you can efficiently retrieve the data on demand. So techniques such as vector based search are techniques to store the knowledge as embeddings would help a lot in this particular step. There are custom solutions available in the market like AWS open search and AWS document DB to store these documents. There is another system called Pinecone which is a popular vector database. You can index all the documents that are relevant to your company's knowledge into that vector database as embeddings using one of the FIAS storage techniques using the fyess engine and then use KNN to basically retrieve those documents. Then third part is the retrieval develop the retrieval system. This system, usually we aim to be very fast, so we try to do a KNN search on the database. Typically a KNN search or you can probably do a semantic search or a keyword based search. The retrieval system has to be fast to give the documents more relevant documents so that you can plug it into the LLM as part of its context. The fourth is the combining the retrieval responses and adding it as a input query to the LLM before forming your question. So apart from the knowledge limitations that are also typically cost concerns and around the LLMs, typically these LLMs are resource intensive. They require high computational requirements. For example, the Lama three 8 billion parameter model and 70 billion parameter model. You would need a minimum of g 512 x large for the 8 billion parameter model and p large for the 70 billion parameter model. And you're looking at a cost of $7 per hour for the twelve x largest and $37 for the 24 x largest. And they have four to eight Nvidia GPU's of different configurations. And second, you are looking at a high cost and the maintaining maintenance of this knowledge basis. Typically if your knowledge base is huge like we have it on my team, we have a huge knowledge base of infringements of around 20 billion documents. Sorry, my bad, 2 billion documents that costs us around million dollars a year. So unless you choose some optimized ways of storing these documents, like IVF flat format or IVF product quantization techniques applied to the document are choosing it as indexing strategies. While indexing will help a lot in reducing the cost, the third is the operational maintenance costs. The maintenance of LLMs is a significant factor because you have to scale up the LLM according to your needs. You have to basically fine tune it. It also, the fine tuning process is also kind of slightly expensive because you would need to procure more hosts for fine tuning and then typically you run into availability issues. Some of the cost reduction strategies that we can look at is basically using if your system, if your use cases do not warrant for a deployment of a fine tuned model, then you can use pre trained models and you can interact with them with APIs and other offerings by the cloud service providers. Typically they charge you by the request so you don't have to bear the upfront cost of hosting it and keeping it alive. Then you can also leverage foundation models of model offerings by cloud service provider like AWS, bedrock and Sagemaker. They have a good set of popular models where you can directly use it without having to host it yourself. Then you can optimize a large model into a smaller model by model distillation, transfer the knowledge of the larger model to a smaller model, distill that knowledge and then have a smaller model. Run your request, process your requests and you can also do quantization by changing the precision for the model from FP 32 to FP 16, which will bring down the memory. And you can also prune the model to remove the unnecessary weights or layers and probably reduce the size of the model significantly. Then for efficient resource utilization, you can choose to configure auto scaling, automated scale out and scale in based on your traffic patterns. And then you can batch more and then go with an asynchronous invocation where you don't need the response immediately. You can reserve some of the instances on Sagemaker and other cloud service providers so that you can procure the host at a cheaper cost. You can cache your responses. These are some of the strategies that you can employ then for the data management for hosting knowledge databases are indexing solutions. You have IVF flat, IVF PQ. You can prefer these techniques indexing techniques instead of storing the documents in HNSW format to reduce the memory and thereby reducing your costs, you can also use model cascading. You can deploy the smaller versions of the model or low precision models at cheaper cost as a filter. And then for those requests that come out of these smaller models, you can probably use a complex model to look at some of the complex patterns. So just like a filtering technique, you can do the model cascading as well. Prompt engineering is about crafting inputs that guide the model towards the desired output. A good example of an effective prompt should contain contextual information about the task. Reference text for the task clear and complete instruction clear instruction at the end of the prompt and as an option, you can specify the format of the output for a task like text classification. Here is a good example by anthropic cloud, where you have the description of the task, reference text for the task, and the classification labels. Another example of question answer based prompt you need to provide the instruction reference based text and at the end you have a clear and concise question for the text summarization task, you have the text for the reference text and a clear instruction to summarize it in the format you choose. For code generation, a clear instruction on what you want, and the specific programming language that you need the code to be in. Large language models offers a myriad of applications for both software engineers and tech professionals. Let's explore some of these practical uses. As a software engineer, automated code generation can significantly speed up development by handling repetitive tasks and providing code suggestions. For instance, GitHub copilot can generate code snippets based on comments. LLMs assist in code review and debugging by identifying potential bugs and suggesting fix. Similar to tools like deep code and codeguru, generating documentation becomes easier with LLMs, which can create detailed doc strings, readme files, and API documentation from the code base. Natural language interfaces allow for more intuitive software interactions, enabling users to perform tasks using chatbots or voice assistants. As a tech professional, technical support is enhanced with AI driven chatbots that provide first level support, reducing the burden on human teams and improving response times. LLMs can analyze data, generate reports, and extract insights from textual data, aiding in decision making and strategy formulation. Content creation for marketing documentation and internal communications can be automated, streamlining workflows and ensuring consistency. Training programs powered by LLMs offer personalized learning experiences, making knowledge sharing more efficient and interactive. In conclusion, by leveraging LLMs, both software engineers and tech professionals can enhance productivity, improve efficiency and innovate in their respective fields. Coming to how we do it in our brand protection organization how we leverage LLM we leverage LLM for trademark and copyright wireless detections. We analyze the brand names, logos and other intellectual properties on the product listings and we try to identify the brands to whom the trademarks belongs to. We have a corpus of trademarks and copyrights belonging to the brands for it for, I believe, trademarks for around 1 million, sorry, around 100k brands and copyrights and logos for another 50,000 brands. For the counterfeit detection we do use LLMs to notice to recognize subtle differences between genuine and fake product listings and the lms are also very helpful in detecting obfuscations like people who use n one ke instead of Nike and for analyzing behavioral and analytics of the seller behavior and some of the real world examples that we have on our site, the last one being ours. The first three are public now. I guess everybody now can see review highlights on the product listings page of Amazon you see a summary of what customers say. Then there is a this is early access for the offered offer to the sellers when they create listings on Amazon. The LLMs can generate content based on very small description of the product that you are selling. It can fill the gaps or it can fill more details about the product. Then Amazon pharmacy started using this LLMs recently to answer questions more quickly because the LLMs can now look at the whole corpus of internal wikis and provide more info, more information on the drugs, and much more. Quickly then in our space, we reduce the human audits for detecting infringements by 80% for famous brands like Apple, et cetera. For hard to find copyright violations, we run the LLM for around 1 million products a day. And the final output coming out of the LLMs that is flagged for deeper look is around 20% of those 1 million. So around two hundred k. And finally, thank you for this opportunity.
...

Satyanand Kale

Senior Software Development Engineer @ Amazon

Satyanand Kale's LinkedIn account

Gayathri Shivaraj

Senior Program Manager @ Amazon

Gayathri Shivaraj's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways