From Proof of Concept to Production: Practical Strategies for Implementing Domain-Specific LLMs

Video size:

Abstract

Unlock the power of Large Language Models! I’ll share insights on fine-tuning vs. Retrieval-Augmented Generation (RAG), reveal best practices, and discuss key lessons from real-world use cases. Learn how to take models like LLaMA 2 and GPT-4 from POC to production for real impact.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. My name is Chetan Nepatak. I am currently working as a Chief Producting Technology Officer for Analytica Data Labs. In today's session, I intend to spend 25 to 30 minutes talking through some of the Gen AI abilities that we've been building in our product LEAPS, which is essentially a low code, no code platform. and I'll give a bit more context about that in, in just a few minutes. but before I get started, a little bit of, background and introduction about myself. I have been, working in the space of, AI, ML deep learning for the last, 12 years now. by, By education, I'm, I've done two masters. I'm a MBA from me and PC Paris, and also, master's in advanced computer science with focus on, AI. I started off with computer vision, and, went on to build, a deal models in, predicting, in price prediction engines. and then a few other models around, in the fintech space, which were really about, doing, evaluations using machine learning models for, both insurance and loans. and then the last, one and a half, two years, been working on the product to build in, generative abilities, and that's been the focus for the last 18 months or so. and my today's discussion will revolve around one of the products that we have built and rolled out, which is both stable and it's scaling rapidly, and one of the abilities. and I will talk through some of our experiences there, and some of the best practices that we have seen that has worked for us as we moved along from POC to production. which is, a very apt, topic for, for discussion, today. So moving on, four essential talking points today. I'll start off by giving the context of the app, and then I'll move on to, talk about the various components that go into, the, the app architecture or the abilities architecture. and then a little bit of, discussion on prompts and rank, rank, and then finally, quick note on what's coming next and what the things that we're working on next. so that's, the agenda, for this discussion, Coming to the context of the app, like I mentioned earlier, the, the app really is built grounded for, that they're a citizen for people who are non technical. This is a low code, no code platform which does end to end data science, and, it, you can not only build models, so it has abilities which are pre modeling, abilities, it has the modeling abilities, and of course, the post modeling abilities which is deploying and serving the models. and we are continuously enhancing that as we integrate with the larger ecosystem in this space. the, The app that I'm talking about today in this session is the RAG based inference engine. In fact, these are multiple RAG pipelines. So there is a parent RAG pipeline which integrates into child pipelines and the parent RAG pipelines really is the orchestration this inference engine, is akin to a copilot, and this, tries to take the user through, various aspects of bringing down, the barriers to learning and the barriers to, working with the, with the platform as large. we are trying to address three, broad umbrella questions, or abilities with this copilot, the first one being, what, can I do next, which is, nothing but the recommendation engine, again, a RAC pipeline, the second one being, what else can I do? This is, this is basically the scenarios engine where, the LLM model, and the associated RAC pipeline come to an agreement in a assisted mode with the user, as to what the intent is in terms of the scenario and then the back, backend workflow, build those scenarios for the user. and then lastly is the, do it for me, which is our processing engine. again, abilities, different abilities in, product coming together in a rack pipeline. the processing engine essentially, helps the user. give a very few inputs and, generate or either generate models, do, any kind of pre processing and even build, charts and dashboards. So these are the three broad abilities of the copilot. And then there are some abilities, which again, are part of this larger three, So that's the context of the app and it is this app that we've been focused on the last one year and we've been evolving our RAG architectures to enable this co pilot abilities. How did we get started? Obviously we got started like everybody else did, building simple RAG architectures or pipeline. we obviously had to move on to, making sure that, whatever that we are building is, is relevant, to, to the user and has the, the right, responses. so we had to quickly move to a more evaluation driven, development or metric driven development. and then gradually, we build our evaluation landscape, which is again growing, as there are more and more tools available and more and more metrics, become available for, every component that is there as part of the entire RAC pipeline. we are, gradually adopting to that. But as of now, I think it's time today. These are three key, moving parts, to our matrix driven development, to our entire evaluation landscape. we are, in terms of approach, very, reference based, which means that, everything that we're doing based on ground truth. And to do that, obviously, you need a very, you need rock solid, gold standard eval data. We, we are continuously building that. Again, completely labeled and annotated by humans. in terms of, the scope of evaluation, we do end to end, and of course, component based. When it comes to component based, we right now focus, again, more on, On the, the retrieval and the generative components, but we are moving towards, building more evaluation metrics, including unit testing as we bring in more complexity and more components in our, in our pipeline. so generative, and, retrieval metrics are, Build using the Raga framework and we using the standard matrix there of F1 score which is basically a good balance between precision and recall as far as retrieval is concerned and for generative we are basing our evaluation on faithfulness and, answer relevancy. the other element, important element of our, matrix treatment development is, is the design consideration. this is, this is key, Point or guiding principle rather which helps us figure out what are the kind of matrix we need to bring in and Also, what are the kind of components which we need to either enhance or what additional layers are needed In terms of design consideration we I use three, deterministic versus non deterministic, which basically means that whether, how much of large language models are we going to be using for each of our components, we are currently using at two, two, two of our components, use large language models, and we gradually, evaluate that, and based on, how much of, deterministic and non deterministic components we have, that will drive our evaluation landscape as well. and then next, of course, is, turns, very important. Now, when we started off, we were very boundaried, we were, we did not allow an open conversation. but now gradually from a single term conversations, we have moved on to multiple turns. and then we will, open up, make it, a very free flow, text. we are still a little boundary, but, we are multi term conversational co pilot. that's, what the current state is, and we are, continuously testing that, and the third important, design consideration is the prompt flow, based on, again, whether a single turn, multiple turn, the complexity, the use case, of the copilot, and as we build in more and more abilities in copilot, I've spoken about the three umbrella, but then within that, there are a lot of those smaller abilities, that determines what is the kind of prompt engineering and prompt flow that will be needed, whether these are complex queries which needs decomposition, or they, need any other query enhancement treatment. based on, that we decided, what kind of enhancement or component we will be needing, and what metrics we will be needing. So this is really our, evaluation landscape and how we go about, bring, taking these three pillars to drive our metrics driven development. in terms of how we progressed on this, we started off with eyeballing, at the POC stage, and throughout the POC stage, all that the teams were doing was eyeballing. and as we were doing eyeballing, we had our experts, which is, our, analytical team, which, started to use the eyeballing outputs, and the pairs of query, query context and responses, to build in the eval data. so we started building the eval data right from our, POC stage. we also use certain synthetic data generators, to build those pairs. and then, we moved on to doing, as we developed more, larger eval data set. we moved on to doing, structured, supervised, evaluation. and now, of course, we're trying to move to LM as a judge. this, of course, needs very sophisticated prompt engineering. and once we achieve this, we'll have a full instrumentation or automation of our evaluation. these are the three, these are the three, prominent stages, in our evaluation journey. we are, just about starting to use LLM, as a judge. but our, we continue to use, eyeballing and our supervised, ground truth, evaluation model, is working well for us. so that was around, around evaluation, which is really the big piece and that takes a lot of our time. and then the next, important element or pillar, is the data pre processing. we started off with, various chunking, methodologies or strategies, fixed chunk, preprocessing. Fizz chunking, content based chunking, but we settled on to semantic chunking because, when we, did our comparisons, the semantic chunking was working best for us, but then, it depends on the kind of use case and the, source data, data that you have, you could potentially, be using a hybrid chunking strategy, which might just work better. parsing is of course the most fundamental thing. That's where everything starts, depending on what kind of documents you have, whether those are code files, those are, webpages or PDFs, Word documents or even images. Depending on that, you would be using different parsing algorithms or tools. we've tried many, we've, and our current tech stack is, inclined more on the, parser variable from, the Lama landscape. sparse and dense retrievers, we, we are starting to feel that we would be needing dense retrievers, now, although, sparse retrievers have worked, for us, the DF, IDF, and BM25 is what we have used, so far. but we started to get some vocabulary mismatch problems, and, our teams have now started the next POC of, putting, the Dense Retriever components, in our RAC pipeline. we have been focusing on metadata from day one. a lot of our, abilities are driven by this metadata. and when we say metadata, metadata, in context of rank obviously is the metadata of the documents and the, metadata of the chunks and also the, potential metadata around, even prompts if you're doing good prompt versioning, but we also do metadata around the context of, the user, and that context is built in, what we refer to as the, the organizational business map, and the KPI. So we captured that metadata too. And all of this is in our metadata database, from where we, pull in the context, which, then gets fed into, which I earlier mentioned into our orchestration layer. and then from where we do the routing for, the various, RAC pipeline. so these are the four, Essential elements of our data pre processing that, we currently engaged in and, now, starting to experiment, and bringing the dense, retrieval capabilities, in a platform. Query Enhancement, we didn't do much of it initially, but we, we wanted to ensure that we have the best practices in place to make sure that the intent is fully understood. and as the, as we opened up and as things became more conversational and the conversation boundaries were loosened up, it was, it became clear that enhancements would be needed to, break down complex, Queries, to decompose them to ensure that the intent becomes very clear. so we do a query decomposition, and we use, a large language model there. the other thing that we started to do now is to, do, is to do hide. Which is, again using, which will be using a large language model. It really goes a little further than using a supervised encoder. and it, builds, theoretical documents. and those documents can then be, It is classified as part of one, classification of documents from where we then generate, pairs of queries which are very relevant to that particular chunk. So chunk classification, is s something that we intend to build by, not only using the, document chunks that we have through our semantic chunking, but also by, using, the hide technique. Routing, we've been using this, since day one because we have different databases. So we have a routing layer and this routing layer based, on, which, which database we need to send the query to, does its job. so the routing layer is responsible to understand the query. And, again, it's basically, decomposing as the queries become more complex, understanding the intent of the query. and then routing it to the correct place now as we do all of these things, it is, it is, one of our challenge is to ensure, that the complexities that we're building in, they don't impact the latency. and so our DevOps team, our infrastructure team, are all a very integral part of, doing everything, on the rack pipeline, and, building the rack abilities. And what kind of deployment architectures and infrastructure we can use. We currently are on AWS. The product is on AWS. And we are increasingly evaluating every service in the JNI ability on AWS, including the serverless abilities that AWS has to offer. To ensure that we offer good latency. we, in terms of our, in terms of our go to market, we are, very B2B, a high touch model, but our abilities are low touch. while we don't do a hundred percent SAS, SAS based model. So it, it is, sometimes behind the firewalls. The application is sometimes behind the firewalls of the customer. So we have to be very clear in terms of the various, architectures that are needed based on the usage and concurrency, and also the, associated cost. our trade offs, is very important and we spend a lot of time understanding that and building the infrastructure architecture around that. so that's the, that's the other, important piece and element, of our, of our, larger pipeline, which is, continuous enhancement of the queries. Thanks. the next, the obvious candidate, after query enhancement, that we worked on, significantly on and we continue to work, is, the retriever and re ranking. we, didn't have time, We had re ranking from very early days, but, more retriever abilities, are gradually being built. we generally classify these abilities in three broad categories, is the, is retriever, post retriever, and, and generation. and while there are a lot more, techniques here, I've only, mentioned on the slide. Thank you. The ones that we either are using or we intend to start building, immediately. iBead Retriever, is, we've had this since day one, the filter vector search. which is simply a filter on top of your vector search. That's, that, is standard now in most rank, architectures and, we, we continue to use that. re ranking, we, we've tried both bi coder and cross encoder. I was settled on, a cross encoder and we, we found that the cross encoder works better, in our use cases, simply because it compares, two sentences and payers. So as long as you have a good, training data of payers, cross and code is a way to go. and, we currently use the coherent reran, which is, which works on, cross encoder. But we do have certain, by, by, by encoders still working, but I think, gradually we're going to phase them out. Hierarchical indexing again, seems like this has been adopted well and has become a standard. this is, this significantly enhances the precision of the rank application, You simply, in a hierarchical, indexing, you, we organize the data in a hierarchical structure as the name suggests, and you have categories and subcategories based on, on relevance and relationships, and the retrieval process focuses on those relationships and the relevance, and the, and they process the retrieval, by, doing, what we call as a parent and teacher relationship. and channel nodes. hierarchy indexing is we use extensively, so that's, working for us. C RAC, CRAC, whichever way you pronounce it, which is corrective, corrective RAC, this is, a new technique. We've started to use this, or experiment with this, so we have current POCs, which are running in this, C RAC is, is It, it brings in another component, which is a lightweight retriever, and this retriever really self evaluates the, the quality of the documents retrieved, and it provides a confidence index, so we are, we're hopeful that, this particular component is again, going to, LLMS as a charge, help us, instrument, do the full instrumentation of, of a evaluation, landscape. like I said that we want to move towards as much of automated evaluation as possible. why we will still be, focused on, making sure that our, eval datasets are robust. but having produced those, eval datasets, we want the rest of the pieces to be really as automated as possible. fine tuning, again, very helpful. this is the, generation, we keep the LLMs that we use fine tuned, based on whatever the context of that LLM is, is, whether it's for generation or whether it is for query composition or even some of the other, abilities that we're building that would require, use of LLMs. so we, that's a standard practice, wherever we have data, we fine tune it. Wherever we don't, we, generate synthetic data and we fine tune it. So that's the retrieval and re ranking, abilities that we have, and that we're currently using and, the one, I spoke about C Rank that we intend to use, going forward. So that was about the various principal components that we have, in, in a RAC pipeline. A RAC pipeline, at a very high level, has, looks like, the, The pipeline that I have on the deck, we have the queries. Then we have query transformation. we are routing across three, different databases. We have the product docs database. We have the user history and what we have the context TV. The context TV is, is, is really the context of the business where we have the business maps, the processes, the features. all of which is used to develop, analytical models. we're currently using, like I said, two places, large limit models. One for query decomposition, where we're using chain of, chain of thought, prompt engineering. and, we're using, prompting for augmentation process. so the key takeaway really is that, prompt engineering and drag are really tied together. obviously we don't hope, we don't want to over engineer, prompt engineering, wherever it's not required. But the key takeaway is that, wherever we use, whichever component uses a large language model, That is where we will be using sophisticated prompt engineering and that is how these two are really married to each other. A simple rank probably uses prompt engineering only in the augmentation process, but as you bring in more and more LLMs for different components, as I spoke about, sophisticated prompt engineering is absolutely bedrock for, doing, relevant and, a more, a correct, RAC pipeline. in terms of our next steps, We are trying to move to a modular RAC architecture. This is also the need and, from our customers, wherein, we are going to be doing subscription based pricing. So there are different abilities and these abilities need to be dockerized and, deployed, in a microservices architecture. So making sure that we are ready for that. And as are each of these components and these services grow, not in terms of the usage, but also in terms of, the data that they're handling and the kind of, responses that they need to generate. so as that grows and it becomes more sophisticated, We, these services need to, take a life on its own, of their own. And, we would need to move towards a modular RAG architecture. we are, one of our challenges has been, versioning of prompts. so we are, evaluating different products where we can have end to end versioning of prompts also because like I said, we are putting in more and more LLMs in different components of the RAG architecture. we are almost ready, with the, with the unit testing, and, we're using ProudFu for that, and that's something which has become part of our, development lifecycle, that is currently, running. it's almost a practice now, and then lastly, we're looking at, agentic abilities wherever they are needed. like I said, earlier when I was describing about, the application, there is, the parent, orchestration, engine, inference engine, and, right now it's built on, custom rules. but we intend to bring in, use of other tools as well to make sure that, context is brought in from other data sources. and we have agenting abilities there. so those are the, some of the next steps that we intend to take. and that's, that was a high level view of. What our application does, what is, the current, architecture of our, RAC pipelines, and some of the, learnings that we have had, on working on these abilities, for the last one year. so I think I'm back on time. I hope, I hope this, session was useful and, the audience got, some important, pointers from this. I will be very happy to take questions whenever the opportunity arises. thank you so much, for, being there, and listening.

Slides

Download slides (PDF)

See all 40 talks at this event!

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

From Proof of Concept to Production: Practical Strategies for Implementing Domain-Specific LLMs

Video size:

Abstract

Summary

Transcript

Slides

Chaitanya Pathak

Chief Product and Technology Officer @ LEAPS by Analyttica

Join the community!

Featured event

2026

2025

Info

Conf42 Prompt Engineering 2024 - Online

November 14 2024 - premiere 5PM GMT

From Proof of Concept to Production: Practical Strategies for Implementing Domain-Specific LLMs

Video size:

Abstract

Summary

Transcript

Slides

Chaitanya Pathak

Chief Product and Technology Officer @ LEAPS by Analyttica

Join the community!