Building AI Applications with LLMs: Practical Tips and Best Practices

Video size:

Abstract

Understand the problems and trade-offs you’ll encounter when building an AI application on top of LLMs based on my learnings of building KushoAI, an AI agent used by 5000+ engineers to make API testing completely autonomous by leveraging LLMs.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hello, everyone. My name is Saurabh. I am the co founder and CTO of Kusho AI. And today I'm going to talk about practical tips for building AI applications or AI agents using LLMs. At Kusho AI, we have been, working on building AI applications. And, agents for the last, 18 months. And, during our journey, we have identified a bunch of unique problems that people generally face. And, we have also faced, those same problems specifically while building, applications using LLMs. And, the, agenda for today's talk is that, we want to, educate devs about these problems, so that. when they are building apps on top of LLMs, they are aware of these problems and, also discuss what are, the solutions that work for us and, the dev tools or tooling that we use to solve these problems. and, by, by sharing this information, we, we want to save time, when devs are, building applications on top of LLMs for the first time. The first thing that you will need to solve when you start building apps on top of, LLMs is, how to handle LLM inconsistencies. you, if you have some experience building, applications using normal APIs, you would have seen that they don't, fail that often. So, by building, general applications, you don't really worry about, inconsistencies or failures, that much. if an API fails, you just let the API fail. And, the user, when they refresh their page, you make another API call and it will most probably succeed. but in case of LLMs. this is the first thing that you'll probably need to solve, when you're actually building an application because, LLMs have a much higher error rate, than your normal APIs. And unless you solve this particular thing, your application will have a terrible UX, or user experience. Because, in your application, you'll generally use the LLM response, somewhere else. And every time the LLM gives you a wrong output, your application will also crash. So, this is the first, problem that, you should solve, while building LLM, applications. Now, before we get into, like why, how to solve this particular problem, let's just talk about why this even occurs, in, in these, in these specific applications. so like I mentioned earlier, if you are, if you are working with, normal APIs, you generally don't worry, much about the error rate. in like fairly stable, APIs, but even the most stable LLMs give you a much higher error rate, than your normal APIs. And the reason for this is LLMs are inherently non deterministic. so what do you mean by that? so if you, if you, if you look at an LLM under the hood, they are essentially statistical machines, that produce token after token based on, the input prompt and whatever tokens have been generated previously, Statistical machines are basically probabilistic. And as soon as you bring probability into software, you're going to get something non deterministic. Now, what do we mean by non deterministic? you basically, we'll get a different output for the same input. Every time, You, ask LLM for a response? you could, I, and I'm, I'm pretty sure like you are, you would have seen this problem while using, all the different, chat bots that are available, like chat, GPT or. the deep seek or cloud, chat, you would have noticed that, every time you give, give, given input, for the same input, every time you hit retry, you'll get a different output. that's the same thing that will happen with, the LLM responses in your applications. you most of the time don't, have a lot of control or, will you get the exact output or not? Now, because of this particular problem, which is, LLMs being non deterministic, every time you give it an input, you'll not always get the response that you want. Like, for example, for example, if you ask an LLM API to generate, JSON, which is a structured output. You might get more fields than, what you asked for. Sometimes you might get less fields. sometimes you might have, a bracket missing. Based on what we have seen, if, if you have like a normal stable API. you will see an error rate of something like 0. 1%, but if you are working with an LLM, even the most stable, the LLMs, which have been, here for the longest amount of time, they'll give you an error rate of, something like one to 5 percent based on what kind of task you're asking it to perform. And if you're working with chained LLM responses, like, basically you. provide, the LLM API with a prompt, you take that response, and then you provide it with another prompt, using the response, that you got earlier, this is basically a chained, LLM responses, you will see that your error rate gets compounded. And, this particular thing, will probably not have a solution, in the LLMs because of LLMs, like I mentioned, are inherently non deterministic, like that is, how the architecture is. So this is something that needs to be solved, in your application. you can't really wait for, like LLMs to get better and, start providing better responses. Like, I mean, they will definitely, get better and, reduce the error rate. But, I think as an application developer, it's your responsibility to take care of this issue within your application as well. So what are your options? The first thing that you should definitely try out is, retries and timeouts. Now, these are not new concepts. if you have worked in software development for a while now, you would know what a retry is basically, when an API, gives you a wrong response, you try it again with some, cool down period, or, maybe not depending on how the rate limits are, retry is basically, you make an API call, the API fails, you wait for a while, and then you retry it again, as simple as that. Now, when you are developing, general applications, I think retries and timeouts are something that, are not the first thing that you would implement because, you just assume you just go with the assumption that the API response rate is going to be fairly reasonable. So. they will most of the time work and, like not adding APIs will, not really degrade the application performance. unless like you are working with very critical, applications like, something in finance or health where, the operation has to finish, in which case you will, definitely start with the retries and timeouts. But, in our general experience, if you're working with normal APIs, you don't really worry about these things. but because LLM APIs specifically have a higher error rate, retries and timeouts are something that, need to be implemented from day one. So, timeouts again, I think, I don't need to get into this. A timeout is basically. You make an API call and you wait for X seconds for the API to return. If it doesn't return in X seconds for whatever reason, you dominate that particular API and you try again. this basically is protection against, the server, the API server being down and, So like, if you don't do this, and if the API takes like a minute to respond, you, your application is also stuck for a minute and, so are your users. So a timeout is basically protection. So that if an API doesn't return in like a reasonable amount of time, you cancel that API call and you retry that API again. that's where timeout comes into picture. So, cool. So how do you implement this into your application? I would suggest don't, write the board for retries and timeouts from scratch, because there are a bunch of, battle tested libraries available in every language that you can use. And, with a few lines of code, add these, behaviors to your application. So let's look at a few examples. the one that we actually use in production is, this one called Tenacity by, it's a Python library. And, it allows you to add retry to your functions by simply doing this. you add a decorator, which is provided by the, by this particular library, you add it to a function and this, this function will be retried. whenever there is an exception or error in this particular function, now, you'd ideally want more control over, how many retries to do, how, like how long to wait after, every retry and those kinds of things. So, those all options are present in this library. You can, give it stopping conditions where you want to stop after three retries. You want to stop, stop after like. retrying, you can add a wait time before every retry. you can add a fixed wait time. You can add a random wait time, all these, different kinds of, behaviors can We added using this library, with like a few lines, of course. So, if you are working in Python, this is our choice, has been working, very well for us in production. so, this is what we, have been using for a very long time. So I would recommend this, if you're working in JS, there is a similar library called retract, very conveniently. that you can, that is available on NPM. So, similar type of functionalities. It gives you, retries and timeouts. Oh, by the way, tenacity also has, timeout related decorators. works the same way. if you want to add a timeout to particular function, you just add that decorator, specify the timeout. yeah, back to. This library, if you are working with the JS application, retry is, our choice. the third option is, basically, a lot of people who are, developing LLM applications are using these frameworks, LLM frameworks to handle, JS. API calls, retries, and like a bunch of different things. So the most famous LLM framework, framework, which a lot of people are using is Lanchain. And if you are working on top of Lanchain. Lanshan provides you, basically, some, mechanism to retry, out of the box. So, it's called, the retry output parser. where, you can use this to make the LLM calls. And whenever the LLM call fails, this parser will basically, handle retry on your behalf, by passing, the prompt again and, also the previous output, so that the, L-L-M-A-P has a better idea that, okay, the last out failed. And, I'm not supposed to, give this response again. So if you're on , then it's already sorted out for you. You use the retry out pass. All right. So this sorts out how to implement retries and timeouts. The next most common reason for a failure or LLM inconsistency is when you're working with structured outputs. So when I say structured output, I mean, something like you asked. The LLM to generate a JSON or XML CSV, even list arrays, those kinds of things. So, whenever you're asking an LLM to generate a structured output, there is a slight chance that, there'll be something wrong with that structure. Maybe there are some fields missing, the extra fields, in case of like JSONs, XMLs, there are brackets missing, might happen. So how do you handle that? The simplest way to do that is to, is to integrate a schema library instead of like doing it on your own every time. So, a schema library could be something like Pydantic. And, this is what we use, in our production. So, Pydantic is basically, the most. commonly used data validation library in Python. And what it does is it allows you to, create classes. in which you describe the structure of your response, and then you use this particular class to, check whether the LRM response fits, this particular structure or not. it will check for, fields, extra fields or less fields that will check for data types, and a bunch of other options. So on Python, just go for Pydantic, it is a tried and tested library. and, it will make the data validation part when you're working with structured outputs, hassle free. similarly, if you're working with NPM, there's something called EOP, same stuff as Pydantic, data validation. you essentially, Define the shape of your output. And, you basically use that shape, which is essentially a class, JS class, or JS object to, check or enforce, the structure of your L responses. and the idea is to use, these, these, data validation libraries, along with, retries and timers. So. what you basically do is when you make an LLM API call, and you get a response, you pass it through Pydantic or YUP or, whatever data validation, library you are using. And if you get an error, you use the retry, to like, like basically let the LLM generate that structured output again. most of the time, you will see that, a couple of retries sorts it out. Like, it's not like every API call will fail in the same way. So if let's say there are a few things missing, you can in your structured output the first time when you do a retry, the next time you'll get the correct output. but just, just as a, as a, general advice, if you see that there are particular kind of issues happening again and again, you should mention that, instruction in the prompt itself. Because what happens is that, when you do retry, like, an API call, which. Was supposed to take five seconds might end up taking 15 to 20 seconds and, it will make your, make your application feel laggy. because, at, at the end of that API call, you're going to provide some output to your users and, they're waiting for that output. so if, you see that there are particular kind of, problems that are happening again and again, like for example, if, if you are, generating JSON, using. JSON using an LLM API. And you'll see that, like the LLM is always using single quotes instead of double quotes, which will generally cause issues. you should specify that as an important point in your prompt so that, you get the correct output in the first attempt itself. this is just like, an additional level of check, but the idea is that the first response should itself give you the correct output. So. anything that is, that is known should be mentioned in the prompt as a special instruction, so that you don't keep retrying and you use this and not waiting for an output. one, one additional option worth, one special mention here is, the structured output capabilities provided by OpenAI. So, if you're using, GPT models, and OpenAI APIs, what you can do is there is a response format field where you can specify a PyDynamic class and, the OpenAI APIs themselves will, try to enforce the structure. but this one problem here, which is if you want to switch out. The model and use something else like Claude or Grok, then you have basically, lost the structured output capabilities because those are not available right now in other LLM APIs. So, my suggestion is to just handle the, schema enforcing and checking in your application itself so that like it's easy for you to switch out the models and use different models. That's all for handling LLM inconsistencies. Two main things, retries and timeouts. Use them from the start. If you are working with structured outputs, use a data validation library to check the structure. You see the options here. Any of these are good. The next thing that you should, start thinking about is how to implement streaming in your LLM application. generally when you develop APIs, you, you implement, you implement normal request response, like, you get an API call. And, the server does some work and then you, then you return the entire response in one go. in, in case of, LLMs, what happens is sometimes it might take a long time for the LLM to generate a response. That's where streaming comes into picture. Streaming, your responses allow you to start returning partial responses, to the client. Even when, the LLM is not done. done with the generation. So, let's look at why, streaming is so important while building NLM applications. like I mentioned, NLMs might take long time for, for generation, to complete. Now, when, when your user is, using your application, Most users are very impatient. you, can't ask them to wait for, seconds. Like I'm not even talking about minutes. if you have like a 10 second delay in showing the response, you might see a lot of drop off. so, and, and like most of the LLMs that you would work with would take like 5 to 10 seconds for even, you know, The simplest prompts. So how do you improve the UX? and make sure that your users don't drop off. that's where streaming comes into picture. what streaming allows you to do is, NLMs generate. The response is token by token, like they'll generate it word by word. And what streaming allows you to do is you don't need to wait for the LLM response or output. What you can do is as soon as it is done generating a few words, you can send them to the client and start displaying them on the UI. Or, whatever client you're using, in this way, the, the user, doesn't really feel the lag, that, LLM generation results in, what they see is that, as soon as they type out a prompt, immediately they start seeing some response and they can start reading it out. This is a very common pattern in any chat or LLM application that you would have used. As soon as you type something out or you do an action, you start seeing partial results on your UI. That is implemented through streaming. the most common or the most, used way. The other way to, to implement streaming is WebSockets. WebSockets allow you to send, generated tokens or vaults in real time. the connection is established, between client and server. And then, until the entire generation is completed or, as long as, the user is, live on the UI, you can just like reuse that connection to keep sending a response, as and when it gets generated. this is also a bidirectional, communication method. So you can use the same method to get some input from the client. Also, you know, one drawback of, WebSockets is that they need some custom, Implementation. you can't just take like your simple HTTP, REST server and, convert it into WebSockets. You'll need to redo your implementation, use new libraries, probably even new, use a new language. Like, for example, if you are working on Python, Python is not very, efficient, way for implementing WebSockets. You probably want to move to a different language, which handles, threads or multiprocessing in a much better way than Python, like Golang or Java or even C so generally WebSocket implementation. is a considerable effort. And, if all you want to do is stream LLM responses, it probably is not the best way to do it. there is another, solution for streaming, over HTTP, which is called server side events. which basically uses, your. your, server itself, like basically if you are on Python and you're using flask or fast API, you won't need to do a lot of changes to start streaming, using server sentiments, code wise or implementation wise, this is a minimal effort. what this essentially does is, it will use the same. HTTP, connection, which your STPA call utilizes, but instead of, sending the entire response in one shot, you can send the response in chunks and, on your client side, you can, receive it in chunks and start displaying. Now, this is a unidirectional, flow. It works exactly as a REST API call, but instead of, the client waiting for the entire response. to come, the client starts showing chunks that have been sent from server, using server sent events, implementation wise, it's very simple, like you, just need to maybe implement a generator, if you're using Python and, maybe add a couple of headers. we won't get into specific details because these are, things that you can easily Google and, find out. But, Our recommendation, if you want to implement streaming in your application and you already have a REST, setup ready on the backend, just go for server side events, much, faster implementation, also much easier to implement, WebSockets is a bit heavy and unless you have like a specific use case, For, WebSockets, I won't recommend, going that, that, on that path. Streaming is a good solution if, the particular task that an LLM is handling, gets over in a few seconds, like 5 to 10 seconds. but if your task is going to take minutes, streaming, Probably is not a good option. that's where background jobs come into picture. So, if you have a task which can be done in like five to 10 seconds, probably you streaming and, it's a good way to start showing, an output, to the user on client side. but if you have a task, which is going to take minutes. It is better to handle it asynchronously instead of synchronously in your backend server and background jobs help you do that. So what are these particular use cases where you might want to use background jobs instead of streaming? think of it this way. Let's say if you, if you are building something like an essay generator and you allow the user to enter essay topics in bulk. So if someone, gives you a single essay topic, probably, you'll finish the generation in a few seconds and, streaming is the way to go. But let's say if someone, gives you a hundred essay topics, for, for generation and that this particular task, doesn't matter how fast the LLM is, is going to take minutes at, at least a few minutes. And, if you use streaming for this, streaming will do all the work in your backend server. and, until this particular task is completed, which is going to be a few minutes. your backend server resources are going to get, hop or are going to be, tied up in this particular task, which is very inefficient because, like your backend servers job is basically take a request. Process it in a few seconds and send it back to the client. If you start doing things which take minutes, you will see that if you have a lot of concurrent users, your backend server will be busy and it will not be able to handle tasks which take a few seconds and your APIs will start getting blocked and your application performance will start to degrade. So what's the solution here? You, the solution is, you don't handle. Long running tasks in backend server synchronously. You handle them in background jobs asynchronously. Basically, when a user gives you a task, which is going to take minutes, you log it in a database, a background job will pick that task up. Till then, you tell the, you basically communicate to the user that, okay, this is going to take a few minutes. once. The task is completed. You will get a notification, probably as an email, or on slack. And, what do you do is you use a background job to, pick up the task, process it, And once it's ready, send out a notification, easiest way to implement this is cron jobs. cron jobs have been here for, I don't know, for a very long time. very easy to implement, on any Unix based, server, which is probably, what will be used in most of, production backend servers. all you need to do is set up a cron job, which does the processing. and the cron job runs every few minutes, checks the database if there are any pending tasks. now when your user. comes to you with, with a task, you just put it in a DB and, market as pending. when the cron job wakes up in a few minutes, it will check for any pending tasks and start the processing. And, on the US side, you can probably, implement some sort of polling. To check if the task is completed or not. And once it is completed, you can display that on the UI. But, this is an optional thing. Ideally, if you're using background jobs, you should also, sorry, you should also, separately communicate, that the task is completed with the user, because, the general, idea is that, when you, when, when a task is going to take a few minutes, your users will probably come to your platform, submit that task and they will move away from your platform. So they're not looking at, the UI of your application. So you should probably communicate that the task is completed through an email or a Slack notification. so that the users who have moved away from, the, you also know that, okay, that, that, that generation has been completed. this works very well, minimal setup, nothing new that you'll probably need to learn. Nothing new that you need to install, for the initial stages of your LLM application, just go for a cron job. what happens is that as your, application grows, you'll probably need to scale this. Now, if you run multiple cron jobs. you need to handle which cron job, picks up which task you need to implement some sort of, distributed locking and, all those complexities come into picture. Basically, cron jobs are good for the initial stages, but, like we also started with cron jobs. we still use cron jobs for some simple tasks, but there will be a stage, when you'll need to move away from cron jobs for scalability. and for, better retrying mechanisms, that's where task queues come into picture. So basically think of task queues as cron jobs with like more intelligence, where all the, task management that needs to be done, is handled by the task queue itself. when I say task management, on a very high level, what I It means is that, you submit a task to the task queue. generally a task queue is backed by some storage like Redis or some other cache. the task is stored over there and then the task queue handles, basically a task queue will have a bunch of workers running and, the, the task queue will then handle, how to allocate that work to which worker based on like a bunch of different mechanisms. Like you can have. Yeah. priority queues, you can have a bunch of different retry mechanisms, and all those things. So, two good things about using task queues. task queues are much easier to scale. in a cron job, if you go from one to two to 10 cron jobs, you have to handle, A bunch of, locking related stuff yourself, in task queues, it's already, implemented for you. So all you can do is increase the number of workers in a task queue. And if you start getting more, tasks or workload, the, you can just like, it's as easy as just changing the number on a dashboard, to increase the number of workers. again, like all the, additional handling of race conditions, retries, timeouts, it's already taken care of. All you need to do is, provide some configuration. you also get better monitoring with task queues. you, every task queue comes with some sort of, monitoring mechanism or a dashboard where you can see what are the tasks currently running, how much resources there are. Eating up, which tasks are failing, start or restart tasks and all those kinds of things. So, once you start scaling your application, go for task queues. The task queue that we use in our production is called RQ, which stands for Redis Queue. And, as the name suggests, it's backed by Redis and it's a very simple, library for queuing and processing background jobs with workers. very easy setup, hardly takes 15 minutes to set it up. If you already have a Redis, you don't even need to, set, set up a Redis. for RQ and, very simple, mechanism for queuing and processing. All you need to do is create a queue, provide a red connection so that, it has a place to store the tasks. when you get a task queue nq, you can, and, this is basically a function which is going to get called in the worker to process your tasks. So, it's this simple and you can also provide some arguments for that function. And the worker. For the worker, you just need to start it like this, on your command line. And, it consumes tasks from Redis and, process them. If you, want to increase the number of workers, you just like, start 10 different workers, connect them to the same Redis, and, RQ will itself handle all the, all the complexities. of, managing which worker gets what task, and all those kinds of things. So, if you're on Python, RQ is the way to go. Celery provides you with a similar, functionality, but we just found that, there were a bunch of things in celery, which we did not really need. and it seemed like an overkill. so we decided to go with RQ, which was much simpler to set up on our end. Prompted at what inputs are not working, what models are working, what models are not working and, things like that. So, if you, if you want an analogy, you can think of evals as unit testing. So, think of it as unit testing for your prompts. So this allows you to take a prompt template and individually just test out that template with a bunch of different, Values. and, you can, there are, there are a bunch of, reasons why you should ideally, use evals with your prompt templates. one, it allows you to just test out the prompts in isolation, which makes it very fast. the same way unit tests are fast, because you are just checking one function against different types of inputs. using prompts, using, sorry, I'm sorry. using evals, You will be able to figure out different things like which input works, which input doesn't work, which model works for a particular task, which model does not work. you'll be able to compare, costs of different models for different types of inputs and so on. an additional, benefit of using evals is that. You can directly, integrate them with your CI CD pipeline so that, you don't need to manually keep checking before every release. If your prompts are still working the way they're working just like unit test. You just, hook it up to your CI CD pipeline. And, before every commit or, I'm sorry, after every commit or, after every build, you, straight up run the evals and, similar to, assertions in unit as evals also have assertions or checks where you can check the response, and, specify whether it is as expected or not as expected. And pass or fail an eval. So, that's how on a very high level evals work. we have tried out a bunch of different, eval libraries. the one we like the most is promptful. very easy to set up. simply works using YAML files, basically you, you create a YAML file where you specify your prompt template and you press specify a bunch of inputs, for that prompt template. And, using, so, Promfo is an open source, tool, so you can just like, straight up install it from NPM, run it in your CLI. and at the end of the event, you get a nice, graph like this. which will show you for different types of inputs, whether the output has passed the condition. it will also allow you to compare different, models and, there is some way to compare cost as well. I don't think they have displayed it here, but yeah, cost comparison is also something that you will get in the same dashboard. And, you can start off with the open source version of promptful, but they also have a cloud hosted version. So if you want more reliability or don't want to manage your own instance. that option is also available. before we end the stock, let's do a quick walkthrough of all the different foundational models, or foundational model APIs that are available for public use. the reason for doing this is basically, this landscape is changing very fast. So the last time you had gone over all the available models, so I'm pretty sure that. By now, the list of models and also their comparisons have changed. Probably the models you thought are not that great have become very good and so on. So let's do a quick run through of all the available models. What are they good at? What are They're not good at what kind of use cases? you, what case, what kind of use cases work with a particular kind of model? let's start with the oldest player OpenAI. OpenAI has, three main families of models, which is GPT 4 0 4 O Mini, and, Owen, which are available for public use. I think they've deprecated their three and 3.5 models. so these are the models that are available right now. If you don't know what to use, just go with OpenAI. These are the most versatile models. They work very well with a wide variety of tasks, within these models, between Foro and, Foro Mini, the difference is mainly, the trade off between, cost and latency versus accuracy. So if you have a complex task or something that requires a bit more of reasoning, go for Foro. if you are worried about cost, or if you're worried about, how fast the response is going to be, go for 4 O Mini, but, it will basically, give you lesser accuracy. O L is something that I have not tried out. these are supposed to be, open air flagship models. but from, What I've heard, these are like fairly new. So, before you put it in production, maybe, test them out thoroughly, Foro and Foro Mini have been around for a while now. So I think, you should not see a lot of problems, with them. Also, like reliability wise, as, according to us, OpenAI APIs have been the most reliable. so you don't need to worry about downtime or, having to. handle switching models because, this provider is not working. The next provider is, and stopping. I think for a while, these guys were working mostly on the, chat, the, the, the. APIs were not publicly available as far as I know, but, I think in the last few months, I think that has changed, the APIs are available. You can just directly, and they're completely self serve. You can just directly go, on, anthropics, to Anthropx console and create an API key, load up some credit and get started with it. If you have any coding related use case, cloud APIs are your best choice. I think, as far as coding is concerned, coding as a particular task, cloud, works much better than, all the other models, which is also why you would have seen that, everyone is using cloud with, their, code editors as well, like cursor. so yeah, if code is what you want, work with Claude. Next up is Grok, not to be confused with XAI's Grok. So Grok is, essentially, a company that is building, special purpose chips. They call them LPUs, for running LLMs, which, makes, their inference time on LLMs. Very low, probably, even the inference cost. so if latency is what you're trying to optimize, tryout, grok, grok, cloud, which is their API, which are their, LLM APIs, they generally host, most of. Commonly used open source models. So you have Lama, Mixtel, Gemma available, apart from that, a bunch of other things. Latency wise, they are much faster than all the other model providers. So if you are optimizing for latency and these models work for your particular task, go for it. All right. So AWS, mainly works kind of like rock. They host a lot of open source models on, and along with that, I think they also have, their own models, which we have not tried out yet. but the biggest USP of using AWS, bedrock would be if you're already in the AWS ecosystem and you are, worried about, your sensitive data, getting out of your infra and you don't want to like, send it to open AI or cloud or any other model provider. in that case, Bedrock should be your choice. One good thing is Bedrock also hosts cloud APIs. so, the limits are lower, as far as I know, I think you'll need to talk to the support and, get your service quotas increased. But, if you are worried about, sensitive data and you're okay with cloud, Bedrock should work for you very well. and along with that, they also host Lama and Mixtel. And a few other, APIs, multimodal APIs. Azure is, the last time I checked Azure is hosting GPT models. separately, like, the hosting, which open AI does is separate from Azure. And, the last time we checked, Azure GPT APIs were a bit more faster than open air. So again, like, Oh, if you want to use open API APIs and you, want, a slightly better latency, try out Azure, but they'll make you fill a bunch of forms. I think these APIs or these models are not publicly available on Azure for everyone. GCP, I've not tried out. so, again, like, I mean, I think the setup was a bit complex. So, we didn't get a chance to give it a try, but from what we've heard, the developer experience is much better now. So someday we'll give it a try again. But GCP has a Gemini and the. Latest, the newest kid on the block is DeepSync, if you are active on Twitter, you would have already heard, about DeepSync, APIs, from the chatter, it seems as if they are at par with own APIs, again, haven't tried it out, give it a try. one concern could be, the hosting, which is in China, but, definitely give it a try. probably you might find it, to be a good fit for your use case. And, one more thing, deep seeks models are also open source, so you can host them on your own. And that's all from me. I hope you find the information shared in the stock useful, and it speeds up your development process when you're building LLM applications and, AI agents. if you have any, queries or if you want to, talk more about this, drop us an email. You can find our email. on Kusho AI's landing page, or just, send me a message on LinkedIn. I'm happy to chat about this and, go build something awesome. Bye.

See all 53 talks at this event!

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Building AI Applications with LLMs: Practical Tips and Best Practices

Video size:

Abstract

Summary

Transcript

Sourabh Gawande

Co-founder @ Kusho

Join the community!

Featured event

2025

2024

Info

Conf42 Python 2025 - Online

February 06 2025 - premiere 5PM GMT

Building AI Applications with LLMs: Practical Tips and Best Practices

Video size:

Abstract

Summary

Transcript

Sourabh Gawande

Co-founder @ Kusho

Join the community!