Conf42 Large Language Models (LLMs) 2024 - Online

AI Chats: What Nobody Told You - The Conundrums of Business Integration

Video size:

Abstract

Explore the intricacies of integrating ChatGPT into business systems with data privacy challenges, cost implications, and the strategic analysis required for successful integration. Learn about direct API integration, cost optimization, and customizing privately hosted models to fit your resources and needs.

Summary

  • Today I am going to share with you some of the lessons learned from multiple AI chatbot projects. By the end of the presentation you will have a list of what to pay attention to. And at the end ill briefly describe privacy issues related to llms.
  • Martin will share lessons learned from some of our projects. Be aware that the entire area of Genai is moving incredibly fast. Some of the tools I'm referring to might be outdated by the time you listen to it. After the presentation, I would be really happy to hear from you about your findings.
  • vector embeddings and vector databases are good in natural language problems. The idea is that the vectors are close to each other if the text has similar meaning. Here are some examples of when they do not work that well and what we can do.
  • So far we've been focusing on how we can improve the vector search by splitting the documents. But what else we can do in order to improve the. vector search, you can use something else instead of vector search. Last technique I wanted to mention, and it's actually quite simple but still quite powerful, is preprocessing using large language models.
  • Even if you provide it with very, very relevant information, it's still can make a mistake. There is no guarantee that the information produced by the chatbot is correct. Here is one technique which you could use in your project to prevent the hallucination.
  • Another important consideration when starting a project and deciding which LLM to use is cost. Cloud is cheaper for input but more expensive for output. Be prepared for extra effort if you want to tune specific use case with open source LLM.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Today I am going to share with you some of the lessons learned from multiple AI chatbot projects where we utilized large language models and doing that is actually quite tricky. So by the end of the presentation you will have a list of what to pay attention to, sometimes critical issues, sometimes tiny little details which are still important in the project success. And we will start with the introduction to rag, what it is and what kind of challenges you can expect when building complex applications. Then we will talk about hallucinations, but also how we control the scope of the conversation. So if we are dealing with customer issue, we dont start talking about us presidency, election or any other issue which is not relevant. We will also cover the cost, how to calculate it and whats important in various scenarios. And at the end ill briefly describe privacy issues related to llms and the consequences of various decisions. My name is Martin, my background is in data engineering and mlobs and I'm running a team specialized in everything data. At Tantus data and at Tantus data we help our customers with setting up data infrastructure, building data pipelines, and machine learning and genai driven applications. So during that presentation I will share lessons learned from some of our projects. And a little disclaimer before we get started. We need to be aware that the entire area of Genai is moving incredibly fast. The models improve over time, the libraries, the tools improve, some of them die. So it's really hard to keep track of all that. And because of that, be aware that some of the tools I'm referring to might be outdated by the time you listen to it. And I'll try not to focus on specific tools, but more on problems, solutions, techniques and general ideas. But since there are so much going on in the area of Genai after the presentation, I would be really happy to hear from you about your findings, your experience. So don't be shy and let's connect on LinkedIn. Okay, so let's get started. Let's think about concrete business problem you would like to solve. And let's think about a chat which is travel assistant on vacation rental website. And let's say the customer comes and asks, I need an apartment in London with elevator. How do we know what the customer is asking for? How do we come up with specific information and use that in the chat? So one of the very common answer to these kind of questions is vector embeddings and vector databases. So let's quickly define what they are and why are they good in natural language problems. But then I will show you some examples of when they do not work that well and what we can do. So the promise about vector embeddings is very simple. First of all, you transform the text into a vector and the vector represents the semantic meaning of the text. So two texts which have similar meaning will be transformed into two vectors which are also close to each other. And let's have a look at examples. This is one of the very classical examples. King and queen are somewhat the same role, you can say, and the only difference is gender. So in the perfect vector space, the distance between king and queen should be the same as between men. And you should be even able to do this kind of math like queen equals king, man plus woman. And this is another example. The words red, orange and yellow represents colors, so they are close to each other. Then king and queen are also close to each other and car is somewhere completely else. And this is very flat and simplified dummy example of vector embeddings, because in reality they have hundreds or thousands of dimensions. But the idea, the promise from embeddings is that the vectors are close to each other if the text has similar meaning. So it's not a surprise that for searching information needed by LLM in a chatbot, we likely want to try vector database. So the super basic idea is that you transform your documents into vector, you store them into vector database and you serve the relevant documents to the LLM. And more general version of this diagram is that one when we provide an LLM with access to our documents, databases, API, basically everything needed in order to understand our domain information. And this technique is called RaG, stands for retrieval augmented generation. But once again, why are we doing this? We need to remember that the main ability of LLM is not really the knowledge it comes with, but the ability to work with texts, with texts written in natural language, and ability to follow instructions related to these texts. So LLM has a chance to know only about the information it was trained on and we need to provide it with our specific domain knowledge. Let's get back to our example, our business problem. How do we use that technique? How do we use vector databases for our I need an apartment with elevator in London query. If we have our apartment descriptions in the vector database, what we could do, we could just check if the vector representing our query is close to any of our property description. And what we hope for is that I need an apartment with elevator in London. Vector will be close to an apartment with description apartment with elevator in London and not that close to apartment with description apartment in London. So no elevator mentioned. But once again, this example is way too simplistic. This is perfectly valid technique, and it describes what vector embeddings and vector databases could be used for. But I would like to focus on the challenges which you might face. So first of all, the apartment description will never be just a single sentence. They will look more like this. This is an example cottage in Cornwall, western England, and it's not that expensive. We have one more very similar one. So you can see the descriptions are quite long and you have some extra information about them. But we have also some completely different properties. We have another one in London. It's not a cottage anymore, it's an apartment that is much more expensive. So completely different type of property and one more similar. And then if we take all four properties I have just shown, get the description and take the vector embeddings, then we end up with something like this. I do understand that this is not very readable, but this actually represents very well the problem an engineer is struggling with. So we have four different properties both, but in the world of vectors, they are very close to each other, simply because the wording used in the description is very similar. It's very far away from the word banana, it's very far away from some sentence of some sentence about constitution, but it doesn't really help us because the property description itself is very close to each other. So it's very hard to distinguish what is what in very hard to make a good proposal for the customer. And the reason for that is that if we are using general vector embeddings, they are good in general language, but they are not specialized in a specific domain, and the specialization in a specific domain is usually what we want. So maybe that will come up as a surprise. But magic does not exist. There is no such thing as a silver bullet. So vector databases are useful, but you need to test what will work for you. Maybe you will need to fine tune the embedding so they are specialized to the domain. For sure, you will need to do splitting of long documents because of the context length limitation of vector embedding models, they can accept text up to specific limit, few hundreds, up to few thousand at most. But even if you can fit long text in the embedding model, it does not mean the longer text will work better for you. This is something to be tested. One of the secret ingredients making your chat better is splitting the documents you are working with into digestible chunks, that's for sure. But if I tell you, if I tell you, get the document description, get the apartment description, get a PDF document, and split it into chunks, it will be a bit too simplistic. It will be somewhat like saying, just draw two circles, complete the owl, something is missing. So what we do, what we pay attention to when splitting documents, let's have a look, let's have a look at some of the solutions. When you split the document, what will matter for sure is the size. And all I can say for sure is that very big chunk will not work very well and it's kind of intuitive. The vector size is static and if you try to squeeze too much information into it, you will lose some of it. But other than that, when you split the document, you need to know something about the context. And a good example would be a large PDF, and having just a chunk of it without knowing which chapter or which section it comes from, it will not be very helpful. That's why it's important to keep the relevant information as part of the chunk or as part of the metadata. And if you Google search for what, if the data is too large for LLM context, or if you just scan the QR code, you will get to one of our articles describing these kind of problems. But we also described there a mechanism called self core retriever and it's super useful in situations when you have a granular split with all the details necessary. But still the vector similarity of multiple chunks is too close to each other and it's hard to distinct which one is the best in a given situation. And in such cases it's good to try the mechanism and what it does. It's basically a tool in LangChain which allows us to come up with structured query for specific attributes you predefined. So let's say from a PDF chunk you will extract price or offer name. And if you predefined them, you can have another LLM call for better understanding of values of these kind of attributes. So you, you can make better decision. You can make a better decision about what answer to present to the user and it's very useful. I recommend you to read up on that. But let's move on. One more disclaimer, one more disclaimer about PDF files I mentioned. The disclaimer is that a lot will depend on the format and how exactly you parse the PDF. Sometimes you need to just find a specific parser for a specific document, but sometimes maybe it's worth looking around. Maybe you have a chance to get the data you need from a source which has structure just better than the PDF file. Maybe the same data exists in better format. So far we've been focusing on how we can improve the vector search by splitting the documents. But what else we can do in order to improve the vector search, you can use something else instead of vector search. And I just wanted to say that vector databases are very popular. They are growing, they are very natural to be used in context of natural language processing. But just the fact that they are popular, just the fact that they are very much connected with our lens, does not mean this is the only tool you can use. So for instance, if you have an elasticsearch, or if you have some search API in your company, there is really no reason not to use it, not to try, if it can provide you with relevant info. And at the same time, most of the vector databases, they come up with not only the vector search ability, but they have hybrid search ability. So on top of vector search, you can enable more traditional keyword search, for example BM 25, and you can verify which results are better. Maybe you can mix them together. Maybe you can use both results. And once you mix them together, once you utilize data from multiple search methods, what you can do is you can re rank the responses you received. So in many of our cases we have implemented, we realized that it makes a lot of sense to blend multiple sources, multiple results, blend them together. And what you can consider, except of vector databases, is data coming directly from backend database, from data lake, from data warehouse, internal APIs, but also external APIs like panel data, or from Google search. So far we've been focusing on how we can improve the vector search by splitting the documents. But what else we can do in order to improve the vector search, you can use something else instead of vector search. And I just wanted to say that vector databases are very popular, they are growing, they are very natural to be used in context of natural language processing. But just the fact that they are popular, just the fact that they are very much connected with llms does not mean this is the only tool you can use. So for instance, if you have an elasticsearch, or if you have some search API in your company, there is really no reason not to use it, not to try, if it can provide you with relevant info. And at the same time, most of the vector databases, they come up with not only the vector search ability, but they have hybrid search ability. So on top of vector search, you can enable more traditional keyword search, for example BM 25, and you can verify which results are better. Maybe you can mix them together. Maybe you can use both results. And once you mix them together, once you utilize data from multiple search methods. What you can do is you can rerank the responses you received. So in many of our cases we have implemented, we realized that it makes a lot of sense to blend multiple sources, multiple result, blend them together. And what you can consider, except of vector databases, is data coming directly from backend database, from data lake, from data warehouse, internal APIs, but also external APIs like panel data or from Google search. And then on top of quite aggressive query query, which is providing us with many results, what we do, we do re rank and we select the best candidates, the candidates which are the most promising. So the chatbot can utilize the information from the most promising ones in coming up with the most relevant answer. The last technique I wanted to mention, and it's actually quite simple but still quite powerful, is preprocessing using large language models. So let's say you have some metadata, but in your metadata, you don't have any information if an apartment has an elevator or not. But the customers are looking for this kind of information. And you do have that information in the description in a free text. So what you can do a batch preprocessing using LLM in search for specific metadata, you know, users are often looking for. And then once you extract the metadata, you can just save it. You can enrich your database and use it in your queries. So basically, you are utilizing the fact that LLMs are very, very good in tasks like sentiment analysis, text categorization and so on. You just tell them which category you are looking for, what information you are looking for, and they do it for you. They do it basically out of the box. They are good at this kind of tasks, out of the box. So there is really no reason not to, not to use that fact. Okay, so we've been talking about techniques which leads us to providing the most relevant information to the chatbot. But even if you provide it with very, very relevant information, it's still can make a mistake. It still can hallucinate. So yes, one way of preventing or limiting the hallucination is to provide it with relevant info, but there is really no guarantee that the answer chat comes up with based on the prompt, the data you provided provided with, there is no guarantee that the information produced by the chatbot is correct. So I will show you a very quick demo of what the hallucination looks like and one specific technique which you could use in your project in order to prevent the hallucination. So let's have a look at the demo I recorded. Let's have a look. What we have is Python code where we import a tool called nemo guardrails. It's a tool created by Nvidia. And we have a text file with some questions. We'll have a look at it in a second. And then we define that we want to use an old OpenAI model text, davinci zero three. And then in the file we define some questions. The first question we define is when did the roman empire collapse? So we want to ask that question to the model. And I am asking the question about the roman empire because it's a common knowledge. And the second question I'm asking is how many goals has been scored in polish extraclass in a specific season? So since the first question is a common knowledge and the second one is not, I expect one of the questions not to be the hallucination, one of the answer to the question not to be the hallucination, and one of them. For one of them, I do expect the model to hallucinate. And let's see if the tool can spot what is hallucination, what is not. So let's see, we run the code and we will have a lot of logs. And once we scroll all the way up, after it completes, after it completes, we can see the first question, when did the roman empire collapse? And we get a bottle some bot responses and it's getting flagged as not hallucination. But how exactly did the tool spot that? Let's have a look into the details. Using the second question as an example, how many goals has been scored in polish extraclassa? The bot response we are receiving is 1800. I have no idea if it's correct or not, but the whole point is that what the tool is doing, it's asking exactly the same question for the second time, and then we get completely different response, and then the tool is asking the same question for the third time and you're getting, once again different response. And then what the tool is doing is actually checking if the answers we are getting are in sync, if the meaning of them is exactly the same. So it's actually doing another prompt to the model. And the prompt is you are given a task to identify if the hypothesis is in agreement with the context below and the hypothesis is the original answer we received. So the answer to the first time we ask that question and then context are two extra responses we have received because the tool was asking the same question three times and the answer from the model is no informations are not, which means we flag it as hallucination. So yeah, there are ways of preventing the hallucinations. It's good to be aware of them, but at the same time it's good to be aware of consequences of these kind of techniques, because there is no such thing as free lunch. First of all, you need to be aware of the costs associated with that. The cost of us dollars you pay for the extra API, call the cost of slower system because you make extra API, so you introduce an extra delay, but also the cost of false positive, because there is really no guarantee that this kind of technique always works. But all that, the existence of hallucinations, the fact that we have to deal with them, but also how we have to experiment with cutting the documents, how we have to tune the search engine, all of that can lead to the conclusion that we are back to square one to some extent, and that there is really no shortcut. And even though LLMs are really impressive, you cannot avoid working on the data quality or just careful engineering. Tools like llms are impressive, but you still have to do your homework. The good news is that there are many tools which could help you to some extent. So I mentioned Nemo guardiols, but it's worth looking into memgpt weaviate but at the same time, do not expect that some tool will solve all your problems. Do not expect that you buy some tool which will magically solve everything. This approach, shut up and take my money will probably not work. It's not gonna happen. The tool might be helpful, but the tools themselves are coming with their own problems. The tools themselves are quite immature because basically the entire area of large language models, chatbots and so on, is quite new, quite fresh. And just to show you an example of how the tools are changing, this is the history of code in Lancranc project. And there are tons of changes, which on one hand is a good thing because the project is evolving and it's actually impressive how fast it's growing. But on the other hand, that means you have to be aware of the updates, upcoming changes, there will be some bugs introduced, there will be some breaking changes over time, and you just need to be ready for that. You just need to be aware of that. So we have all the tools which are helpful, but not very stable yet, and we are working with a completely new area and there is a lot of unknown here. And that is why it is really important that you do the testing. And testing of LLM project is really, really tricky. So what you can do for sure, and what you should do is testing of the retrieval because this is fully under your control and this is quite predictable. So it's easy to define the test condition, but you should also test the LLM actions wherever you can. And I say wherever you can because it's actually quite tricky and it's very hard to define reliable tests, reliable tests which cover most of the possibilities. And one of the problem with testing llms is that even if you have exactly the same input in your test, the output could vary. So there is this post on OpenAI forum, and I really recommend you to read the question of determinism. The bottom line is that large language model action is not really deterministic. So yeah, you can have the parameters like temperature, you can set it, and this should control how creative the model is. But there is this misconception that if you set it to zero, LLM will be behaving in exactly the same way. In reality it will be just kind of less creative. But it still might provide you with various results, mostly because of the hardware it's being physically run on. But also you can always end up with two tokens which have exactly the same probability. So one or the other will be randomly selected in your result. So keep that in mind when you write the test, and it's always worth checking the lang chain utils lang chain utils for testing because they take this kind of lack of determinism into consideration and they aim to mitigate it during testing. But what is critical when you move to production is that you collect the data from your run, from your run with real users, because that is really something which gives you the real feedback about how it is going, how the users are using the application, whether they are happy with that or not. Make sure you collect the data. Make sure you analyze that, especially in the early phases of of the project. Let's have a look at legal and privacy aspects of llms. What we need to understand is that whenever we pull the data from any database and then process the data and then eventually pass the data to LLM, our data is being sent to the LLM provider, to OpenAI, to Microsoft, Google, and in some cases it's perfectly fine. But there are cases that you don't want to send the data anywhere because it's too sensitive. And that means that you might want to use an open source LLM installed in a data center you own. Keep in mind that in situations when LLM over API is not possible, you not only have to have a private LLM installation, you also need to have your private embedding, private vector, DB and so on. And installing all that is not a rocket science, but at the same time. It increases the complexity of your ecosystem and there is a lot more that you have to maintain. And let's keep in mind that privacy and where the data is being sent is just one aspect of legal concerns when it comes to llms. I would really recommend to read the license terms of the ones you plan to use. For instance, you should not get misled by the term open source. Open source does not automatically mean that you can do everything with it. Some of open source licenses are limiting how you can use the data produced by the LLM. So for instance, you won't be able to use the data you collected for training another LLM in case you decide to change the model. So you collect the data from the chatbot. You cannot use that in the future for the training purposes. Similarly, generating synthetic data for machine learning model is very blurry area when it gets to llms. So once again, don't assume too much and make sure you don't get into the unpleasant surprises. Another very important consideration when starting a project and deciding which LLM to use is cost. And you might think open source is cheaper because you basically don't pay for the API call. But in context of Llm it's not that obvious. And why is that? First of all, because simple math is not that simple anymore. And what do I mean by the simple math? Let's start with the API calls. For instance, when you are using GPT 3.5, you pay half a dollar per million tokens in the input and then $1.5 for million tokens in the output. But then for GPT four you pay 30 and 60 respectively. So already order of magnitude more. And in general you have a price list. And based on that you can estimate how much single interaction with user can cost and then you can multiply it by number of expected interactions. But there will be a few small asterisks to remember about. So first of all, the math will depend not only on the number of tokens in general, but also on our understanding of what is the balance between input. Cloud is cheaper for input but more expensive for output. And in most cases it's good enough assumption that token is a word. But if you are in situation that small difference matters. Then it's worth looking closer at the tokenizers. It's worth looking closer at them because the models use different tokenizers, and number of tokens consumed for the same text by cloud is different, actually a bit larger than the one from chat GPT. So to make it even more confusing, Google Gemini charges not per token but per character. So the math is a little bit tricky already. But doing back of the envelope calculation should give us close enough number, and it becomes much more complex when we try to do the math for open source. For open source LLM, we host ourselves. Then you don't calculate the cost per token or characters produced, but you start with the price of the machine, price of the GPU, the price for maintenance, and then you need to estimate the expected traffic. If your traffic is low, the cost per request will be extremely high. So it's not obvious math. It's prone to errors. In many cases, it will be more expensive than using APIs, or, or at least the return of investment won't be. I briefly mentioned open source models, and I'm actually coming from the background where I've always been using open source, open source databases, open source data tools, and I really like them. But it was kind of comfortable working with the open source products because the open source was actually ahead. They were leading the innovation and then at some point the cloud providers came. They were to some extent kind of wrapping the open source innovation into more convenient way of using it. But now, I'm a bit sad to say that, but the open source llms are still behind and they don't perform as good as the commercial ones. They are good, they are improving, but be prepared for extra effort if you want to tune specific use case with open source LLM. And of course you can fine tune the model. But before you even do that, make sure that your data is in good shape. Data will be the starting point for you anyway, and the easiest way to start is with rag application instead of fine tuning. So starting with simplerag can provide you with much faster result and much faster feedback from the customer. But if at some point you decide to tune the model itself, beware that there are various types of tuning and they differ. They differ in the sense of how much data you need, what kind of results you can expect, and whether they introduce extra latency. All things considered, building chatbots is an area where you need to experiment a lot. But when you experiment, make sure you don't get overwhelmed by that. Make sure you have business goal in mind all the time, because it's very easy to get lost and end up in never ending experiments. In most cases, you are not creating a research company. In most cases you want to solve some specific business problems. So keep that in mind. So working with llms is a very, very nice, interesting job. But at the same time you need to stay focused on the business goal and make sure you are pragmatic. Thanks a lot. If you have any questions, drop me an email or drop me a line on LinkedIn. I'm always happy to chat. Thank you.
...

Marcin Szymaniuk

CEO @ TantusData

Marcin Szymaniuk's LinkedIn account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)