Conf42 Prompt Engineering 2024 - Online

- premiere 5PM GMT

Prompt Engineering: An Art, a Science, or Your Next Job Title?

Abstract

I will introduce prompt engineering as an emerging discipline with its own methodologies, tools, and best practices. Expect lots of examples that will help you write ideal prompts for all occasions.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
hello everyone! Let's talk about prompt engineering. What is it? Art or science or maybe your next job title? I'm Maxim Selnikov. I'm based in Oslo, Norway, where I work in Microsoft, where I help developers to succeed with our cloud technologies, tools for developers. And, everything related AI. I personally, a developer myself, and I build applications since, late 90s of last century. And I'm big fan of the developer communities. In Oslo, where I based, I founded a couple of conferences and, run a few meetups. My favorite topics to present about are web development, all kinds of cloud development, and of course, AI. Everything related to generative AI, including prompt engineering. I found multiple estimations on, the number of people who use generative AI on our planet. And, the most conservative ones say that it's well over 1 billion people. That's a number, right? What do these people use generative AI for? just for us to start the discussion. I identified three large application areas. First of all, everything related to productivity, your personal productivity, your business productivity, everything related to usage of generative AI in education and science, and many other areas where we can really get super powered by gen AI. Next creativity. After all, it's called generative AI, right? So it can generate for us many, very interesting things. so we can really feel ourselves super creative, but, there should be someone who is building these applications for us. Someone who is, actually, Constructing this UIs that we can use both for productivity and creativity. And yeah, here I'm talking about people who are actually building these AI infused or intelligent applications. And I'll do my best to make this particular session the most useful for exactly this category of people. people, but at the same time, I'm sure that even if you don't have any intention to build your, application, you can still get lots of useful information from this session because, it contains some, general recommendations, some general, practices for your improved Prompting. And, after this session, when you open any kind of generative AI powered service like ChatGPT or similar, you will know better how to communicate. what is common thing in all these scenarios? Let's look how we, for example, ask Meet Journey to create a new image for us, or how we start conversation with Microsoft 365 Word, on, creating some very smart and solid template for our document. Or let's look how we communicate with a GitHub copilot for improving our code, not even saying about the chat GPT as a service itself, where we literally chat with this. Of course, one common thing is the way we interact. And, the fact that it all starts with the prompt with, some text that we write. And, even though model landscape is emerging and there is, next amazing model is, released while I speak, while you watch this session, and there are already, multiple models available. And, It's becoming more and more complex. There are large language models. There is new generation of small language models also multi modality is really available now even though There are different kinds and types of models. Some of them are general some of them specialized Still we all go back to prompts This is why I consider prompt engineering as a separate discipline or at least as an essential skill of many people, not only developers, but, many technical people, many people on business positions. and, what is this after all, this is a process of, designing prompts of, tuning these prompts. And the further optimization, while we keep satisfaction on a satisfaction level on the results that we get back from the large language models. And of course, it's also important to, to follow cost efficiency, because in many cases, we talk about some paid services when we talk specifically about like high and large language models. Let's look closer at the prompt components or if you wish we can call it prompt anatomy in many cases Again, at least when we are the users who chat with chat GPT We don't really think too much about How we structure our prompt. We just communicate and as long as we are happy with the results. We don't want to Dive deeper into all these nitty gritty details That's fine. But when you build your AI infused application It's really good idea for you to know how that works from inside and again the same ideas and the techniques are still applicable to your like day to day conversations with any generative AI powered service. So yeah, let me, introduce how that works. in many cases we have very clear and, concise and straightforward instruction. What we want to get from this particular call to our, generative AI service or, more technically precise. The two large language model exposed by one of these services. and yeah, let me illustrate it by this example what on the screen. Let's imagine that we are building applications for marketing automation that gives us Nice, at least drafts, maybe, or maybe ready to go emails that, share details on some new products that we either produce or sell, or, it's again, it's not obligatory as in role of developers, you can, use the same, ideas and techniques when you just, use ready to ready products, with, some prompts available. Instruction. Of course, we have to also provide some primary data about this product itself. Also, we can provide some context or we can call it secondary data about tone we expect. In this particular situation, we want to be friendly and exciting, but for different scenarios, there could be different ways of, exact, narrative that we expect from the model. Also, we, identify, we set the format, we define the format, we want to get, answer back. And in this particular situation, it's definitely something that, that is in the middle of, overall, developer chain of, of this product, because you see that we expect not just text, but, Jason, definitely this will be somehow processed by the next steps of, our, for example, backend. So we say that we want to get back. Not just subject and the body, but JSON object with these fields. Also, we provide an example and in this particular situation, it plays at least two roles. First of all, yeah, it demonstrates the model, that kind of text, maybe length of the text and again tone and old structure That might be a good fit for us. And also we double down on the format. So this is why, we, put example exactly in the format that we described on the line above. Why do we need this duplication? Stay tuned. I will explain why that might be useful. So this is how we, human or developers look at the prompt. And this is how large language model or LLM understands the same prompt on its end. And that it's split into tokens. And actually, tokenization is the first step. procedure that happens when you send prompt to LLM. So large language model understands your, your prompt in form of tokens. And, when time has come to generate answer for you, it also generates this recursively token by token. On this example, that, many words. are equal to one token, right? One word equals one token. But of course, the real situation is much more, complex, right? So you see that, some words, take. Multiple tokens and exact implementation of how that split is up to implementation of a large language model, or it's a tokenizer. There is no any kind of strict rule how many, Words equals to how many tokens or how many characters equals to how many tokens we can very Approximately say that at least for this generation of LLMs and for English texts 100 tokens is Approximately, equal to 75 words. so that means like one token is maybe around four characters in English text. For different languages, the ratio is completely different. You might ask me, why do we ever need to have this knowledge about tokens? Because Definitely when we are in user role, this is hidden completely under the hood for us. We communicate with sentences in, sentences back. when we are in developer role, Also, this is not that visible at the beginning of, when you start building your AI application, but very soon as a developer, you will understand super important meaning of tokenization. And this is why, because number of tokens in your input and expected output, first of Cost of, this particular call, if we talk about some hosted, LLMs hosted by some vendors. And also there is technical limitation on number of tokens you can send and receive. on this slide, there is some screenshot from, pricing for Azure OpenAI, services for GPT 4. 0 model and 0. 1 preview model. And for legacy purposes, I also listed price for GPT 4. And, that, we are in good situation as developers because prices. are going down. And first of all, yeah, that it's priced by 1 million tokens. So every single token for a few number of calls, maybe this is not something really crucial. The price becomes really different when we talk about some scale usage, but, Yeah. Yeah. Still very important to keep eye on it. And that in GPT 4 it started for 60 per million and suddenly GPT 4. 0 that is more capable, more performant model. It's many times more Cheaper and this is general trend. there is newer and newer technologies from outside of providers of LLM services and yeah, so it's it's very good, especially for example startups they have to recalculate their economy and recalculate in a positive way and next column I want to emphasize on this screenshot is context and this is exactly the Number of tokens we can send in one particular request. So we see it's not number of characters or number of words or bytes or whatever. No, it's calculated in tokens. This is why it's very important to understand, at least roughly estimate number of tokens in your request. prompts. as you see, modern models are quite capable and we talk about a hundred plus thousands of, of the token. So we are not talking about every single, word or white space or, not even about sentences, not even about paragraphs. Maybe you can, currently send. Pages of text as a prompt for these models and again trend is work good for us models are Capable to accept more and more tokens, but still there is a limitation, right? So you cannot send maybe full book at least currently at least at this moment So this is why, again, tokenization concept is extremely important and, we'll, have a few more slides on this topic. And of course, different providers of, these LLM services provide different ways for you to save on, on this usage, all kinds of, caching. All kinds of, batch calculations when you don't need output, here now, but you can wait a bit and then, price for this call will be cheaper than regular. All kinds of, some dedicated, capacities, that really depends vendor from vendor. And you also noticed that, model selection makes real difference. First of all on capabilities, on quality of the output, and second on the price. And, I offer you to use this simple strategy on model selection. First of all, try with the most capable, the most performant, and, in many cases also the most expensive model on the market. maybe you can, try different providers. And, Identify your best prompt that gives you the best possible results Then you can try to downgrade to the next cheapest one Next cheapest model in some cases you might to might need to fine tune the prompt slightly but check the results if Result or completion in technical terms when we talk about prompt engineering is the same or better. Maybe you can Do next step in this downgrade and try even cheaper model Maybe again results will be the same or even better. Sometimes that's also possible and it will give you a chance to save Some dollars if no Then you just go up to the previous step where you downgraded from. It's a simple and efficient model selection strategy. Also, you can use multiple models and in many cases, if we talk about, not hello world AI infused application, this is not just one call to the, LLM, it's a chain of the calls, maybe to multiple models, maybe to even to multiple providers, maybe it's mix between, some hosted services, model hosted by external vendor, model hosted by yourself, and maybe even a model that is running straight on your device or on your customer device. So that, really depends on business scenario. So if you have a chance to use multiple models in this particular feature of your AI infused application, you can follow a simple rule. for. Let's say, complex tasks like generation, you can use expensive ones. This is, where they really shine. And, yeah, every new generation of the model provides better and better results in generation. While summarization, classification, categorization. This is, again, our days, not that complex tasks anymore, at least for a large language models. And the cheap ones do them pretty good. another strategy. is chaining. for example, you want to send to a large, expensive model some large amount of text. This is how your prompt works, right? but you can leverage cheaper and maybe, Like very fast one, maybe your, your own fine tuned model to summarize this, large amount of text you are going to send to expensive one also might work good for you with a minimal performance, decrease, you will save a lot of time. some, some budget, let's go back to talking conversation. So I hope that I convinced you that, keeping eye on number of tokens you send is crucial for, economy of your application, not only for economy, but also you remember that, there was just technical limitation of a number of tokens you send. And after all the shorter prompt, normally the faster you get completion. So it's, like multiple reasons why you want to, minimize your prompts and how to do this exactly. First of all, very simple rule. have a closer look at. White spaces in your prompt. this is something that we can easily overlook because this is not something that is very visible, right? A couple of extra white spaces. We can just, okay, ignore this. But in reality, Some large language models treat every single whitespace as one extra token. Not a big deal. If you talk about one short request, but if we talk about, I don't know, hundreds, thousands, millions of requests that might bring some, some difference to your final bill and the end of the month. Next for, different data formats, try different, different ways to like, implement this data. What I mean exactly can be easily illustrated by example of, how we supply, Date in our prompts. I'm technical mind and my first impression was okay The shorter string with the date the better chance that it will take fewer tokens In reality, not at all. So you see on the bottom line, short format of the date takes six tokens, while on the top line, it only three tokens. So sometimes it's counter intuitive. Again, this is example for one particular large language model. I don't even remember which one different models can do it differently. experimentation is the only way to really identify what's your optimal format for this or that type of, your data. if you talk about tabular data, this, straightforward tabular format is, pretty much, space efficient, and what's very important, understood by LLM. no need for you to, always reproduce JSON like format where you supply caption for every piece of the data. No, just provide some table headers, separated by pipes, or type, tabs, or you'll find your Personal separator and then rows with the data in vast majority of situations LLM will understand what you mean also Language makes real difference. I already mentioned that English is the most straightforward ways for way for Prompt engineering because it's the most efficient while we talk about tokenization at least again for the mainstream Large language models I suggest this because vast amount of data, all this, Wikipedia, public books, et cetera, et cetera, data that was used for LLM training was in English. it still understands other languages perfectly well, but if we talk about tokenization, the same, let's say, Concept same sentence in different language might take more tokens than the one in English i'm not saying that you have to translate everything all single time But again experiment and I want to start introducing some tools and yeah in On next slides, you will see more and more very useful tools frameworks libraries that can simplify your life as a prompt engineer. the first one called LLM Lingua, and this is nothing by but prompt compressor. It's created by my colleagues from Microsoft, and it's open source. So you can take this tool and host it locally or on your own server. And yeah, as the description says, it takes your prompt and compresses, but it does it in a very smart way. of course We can try to do it ourselves, right? But, we do not, not always understand which parts of our prompt are crucial for, LLM to really understand what we mean and which parts we can skip. who can help us? Of course, another large language model, right? And in this particular case, this is compact one, maybe we can call it a small language model. And yeah, so this one is used to identify and remove non essential tokens in your prompt. And with some playing, some fine tuning, you can get up to 20 times compression with either zero or minimum performance. loss because definitely, of course, some time is needed for this, first step, LLM or SLM to compress your prompt. But on the other hand, prompt will be shorter and, it might happen that you save some milliseconds just because of that. if we take prompt from our first example about, email. about, new headphones and run it through this LLM Lingua. I did it without any fine tuning, just as it. So it took literally a couple of minutes for me to set everything up locally on my machine. I immediately got 17% Compression of this prompt and you see a prompt looks approximately the same, right? but definitely Something is removed, but the devil is in the details. LLM lingua knows exactly what is non essential for like other LLMs Yeah, so I really encourage you to try this tool Now, some general recommendations about prompts. Be specific and clear. The more concrete your order, your request, your ask to the prompt, the better chance you get nice results. At the same time, be descriptive and if possible, use examples. Again, you might need to educate LLM a bit on what, you expect as a completion. Order of the components. Of your prompt matters. There is no any kind of, strict rule or rules on that. only some recommendations, and it's on the next slide. Sometimes you need to double down. Sometimes you need to repeat either instruction or format, or any other component of the prompt. again, this is not a. Mandatory, absolutely, but if you are unhappy with, the results you get back, try it, experiment. And for the cases when we talk about classification and categorization, you better explain in your prompt explicitly. Like this. If you don't know which category to put this text in, you better say, I don't know, rather than force put it into the, one or another bucket that you provided in the prompt. It will save you time on, validation of the results. Based on this general recommendation some a bit more technical ones or more concrete ones. We can say like this Back to order matters Normally for your optimal prompt you start with clear instructions in some cases You might want to repeat instructions at the end. Again, this is not a requirement. You just need to experiment Don't be shy to provide very clear instructions Syntax, in your prompt, just, provide some separators between different, sections. You can even call these sections. the better chance you explain LLM what is what in your prompt, the better result you get back. Don't try to multitask. One prompt for one task, you better organize set of, the calls, like chain the calls if you need to, perform something complex. And also, many providers of the models and many models are capable to accept, some extra parameters, not only prompt itself, but some, parameters for, for C in case of GPD family these two parameters called temperature and Top probabilities and they both affect how creative this model is, how deterministic, you can, you want your answers to be. Of course, Models, model output are, non deterministic, right? but if you, for example, put temperature to zero and top probabilities, to zero, the better chance that you will get the same result for, the same prompt. Otherwise, if you put everything to the maximum, it will be as creative as possible. possible. Let me list a couple of techniques that is widely used in prompt engineering. And again, it might be very useful in your, in career of prompt engineer or AI engineer, or just a developer and, zero shot versus few short prompts by saying, short, we mean example here. and let's. Let's imagine that we are building, some automation tool for insurance company for first line of support. And we gather it, question from, Our customer via email automation or maybe a transcribed phone conversation and we want to pass it to this or that department of our company. It's either auto insurance or flat insurance. the prompt is pretty straightforward. We ask to categorize one, two, three, and this prompt illustrates one of the techniques I mentioned giving model and So if the question is not relevant, you better, say it's not relevant, just, mark it as three in this particular case, rather than force push it to either auto or home flood insurance. Again, we'll save some time on validation of the results. This prompt might work good. If it still fails in some, situations, you can educate this a bit. Use the same prompt, but somewhere in the middle, inject a couple of examples, maybe examples that use specific words from this specific, field, maybe it's like from your specific geo area, or I don't know, something that is from a real use cases, and, where a model, for example, failed last time, you can supply this as an example and you'll get definitely better results. Also, you have to find the balance. your prompt becomes a bit. Longer, right? That means a bit more expensive, maybe, slightly, longer time to process completion, but, content or like result is the king after all. We want, perfect, output. Another super powerful technique called chain of output. this particular example is solving math problem. First of all, maybe, I'd say that LLMs are not perfect mathematicians at all. and, normally you don't use this for solving any kind of, math problems. But, I really want to illustrate this technique, by its, simpler of all to illustrate it by exactly this math problem. story. So just imagine that we, want to ask, for an answer on some simple math operation. So in the white, prompt in the light blue, there is, output or completion, more technical, term result is here and it's wrong. So 8 million liters per year is completely incorrect. I don't want to dive too deep. Why that happens in very simple words, LLMs. are trying to find the lowest hanging fruit in, in all this, kind of calculations. Of course, it's not real calculations. but I want to explain to you how to fix this and, possibly many other situations, not all, not, limited to any kind of, math or If we add one simple sentence, let's think step by step and explain calculations step by step. As a result, our completion will be a bit longer, that's fine. But the most important thing is the time. The answer is correct. And, also, by the way, this is not, exact wording. You have to use all the time. This let's think step by step. No, you can experiment with a longer, shorter description, something like customized description, but you have to, force model to literally think step by step or provide reasoning behind every step and results will be better. you will clearly see this. Another technique is not exactly about prompting itself, but how we construct overall communication with LLMs. And in many cases, we need to send multiple calls to complete one particular task. I already told you that. Not good idea to multitask one prompt for one operation, but, you can easily organize prompt chaining by orchestrating some your, some, backend tooling. And, you can use again, multiple models for multiple tasks, hosted model from one provider, another provider, your own model, local model, fine tuned model, general model. you decide, right? What works. Best for yourself. And the whole idea is very simple. You can use output of a previous call to the model or part of this output as a part of the input for your next call. So this way you achieve a very good, situation when you, literally, get what you want with, minimal efforts. Now I want to introduce Cambridge dictionary word of the year 2023. 2023. And this is hallucination. Hallucination in context of our interactions with the large language models. And this is something that is really annoying in all kinds of AI engineering, in all kinds of prompt engineering, all kinds of building AI infused applications. So this is again, outcome of, how LMS were, designed, invented, and how they work. in, in summary, they are very good in making up facts and, making this in a very trusty way. So it's very, complex sometimes to identify what is, correct in this output and what is wrong if we talk about some, as some facts we requested. Fortunately, there are multiple ways for, if not removing completely, but reducing hallucination, mitigating its consequences. For example, you can explain model not only what you want, but also what you don't want to receive back from this. So you can limit number of use cases. Also, this is a recommendation that I already introduced, give, model and out. If you're not sure, say, I don't know. This is, this could, this simple statement could be your part of the prompt. As well as this one, sometimes it's, of course, it's super naive technique, right? But sometimes that, that works. Don't make up facts. As a part of your prompt, if there's a chance for you to organize, multiple short conversation with, your LLM, like chat, like maybe you can, ask every time, are you sure that you have all information to answer this question? If not, you better request some extra months. step by step reasoning and asking model to explain along with the answer. Basically, this is a chain of thoughts technique also helps, but all these points, what on the picture are nothing compared to this. You can dynamically find and inject relevant context in information straight into your prompt. And you can say, Use only this information for answer. I mean forget all your knowledge from wikipedia and the public books Discard this we only need you to use this data in its trade in the prompt. I can illustrate this by Example of us in the role of developers who build internal application for Say our employees to investigate what's possible in medical insurance coverage for them. And let's imagine that we got a question from employee, via chat bot, or again, via some automated email engine. Does my health plan cover annual eye exams? And at the top of the prompt, that we explain all the, prerequisites. We are intelligent assistant and, yeah, you can answer, questions about health care and use Only sources below for the answer and we also directly inject into this prompt These three sources that are relevant for answering this question. Sounds very simple, right? But of course we have to and yeah completion in that case will be nice and 100% Correct. But how do we exactly identify these sources? How do we find these sources? These few sentences in, for example, dozens of, PDFs or documents or somewhere in, in database. How we identify these sources? How we, shorten these sources? How we rank them after all to provide top 3 or 5 or 10, depending on your use case, items. And Answer to this question is Retrieval Augmented Generation Pattern or REG. as name says, it's about retrieve these sources, these data sources, augment them and, augment your promptery and after all generate completion. Generate completion is trivial. This is after, about, sending your final, Call to, final prompt to LLM augment is even simpler. This is just a string operation. This is, literally inserting sources of information into your prompt. Like we've seen on this previous example, retrieve is the real magic. How? based on this particular question or on, question plus previous conversation, we really identify data we need from, again, back to, use case with, dozens of PDFs with all this insurance, information, and of course, here we can leverage. one more large language model, this one, say one that we use for the final call, or most likely in vast majority of situations, this will be completely different one, specialized, and also, you might want to, vectorize your request and, those you have to send this request to vectorize database. Unfortunately, it's all, out of our today's schedule. Scope this could be your homework to learn more about retrieve component of rag and in the real world It's even more complex because it could be multiple sources of this information, right? So before Constructing your final prompt that contains everything you might want to send requests to your Internal database to some external API and who knows where else so maybe you might Want to have some orchestration support. Surprise, surprise. In real world, it's even more complex than just, this orchestration. and if you talk not about this particular call to or chain of calls to, LLM, but also about. Full life cycle of our AI infused application or like a generative AI part of this. we want to get some tools, some, frameworks for ideation, for building, for making everything, real, how we deploy, how we, monitor. it's, let's say very specialized part, DevOps, but specifically for your LLM interactions. We can call this LLMOps. So let me introduce more tools that can help us both with, this, orchestration of our calls and the operationalization of all flow. First couple of tools I want to mention is LangChain and Semantic Kernel. And these are amazing orchestrators of everything you need to interact with any kind of LLM. So these two libraries or frameworks, you name it, are very similar. They support slightly different, Set of programming language slightly different set of Platforms but in a nutshell they are helping you to Get very nice abstraction over all your interactions with LLMs so you don't need to write all these calls all this ping pong from scratch also There are many more very interesting features included in both of the framework. So When you start your new AI infused application, don't ever start, like your, low level, communication with LLM from scratch. Find your perfect, level of abstraction, in these two or any other frameworks. And for overall orchestration or personalization of your LLM application, I recommend you to have a look at PromptFlow, another open source tool. this one also created by my colleagues from Microsoft that in code first way can help you again to keep a full control over full cycle of your LLM, development starting from experimentation. Up to monitoring from flow. You I could look like this. This particular short video is from, the hosted version. There is one on Azure, but you can get the same UI straight in your, VS code ID, via respective. extension called PromptFlow. basically it reproduces your interaction with LLMs as a graph where some nodes are pieces of Python code. Some nodes are calling of LLMs. Some nodes represent prompts where you can iterate through multiple variants, for example, to identify what is the best. And after all, you can host everything on the cloud of your choice. There are many learning resources available about prompt engineering, and, I hope that you will get PDF of, this session with all active links. What is the future of prompt engineering? I have set up the questions. I will not give you answers on. I encourage you, yourself find the answer. For example. Will it be a separate job title or just essential skill? Will it become simpler with tools like Longchain, Semantic Kernel, PromptFlow? Or they become more complex because there is multi modality, you have to closely work with vectors if you talk about React, etc. Doos, will it be democratized and pretty much everyone who ever sent a request to ChatGPT will call themselves Doos? Prompt engineer those, there'll be title inflation or it will be more gated, discipline and who will benefit the most people who understand how our language constructed linguists or people who understand how code works technologies. Or maybe people who know everything about this particular domain, where this application comes from and able to formulate the problem. And do we survive competition with LLM based prompt engineers? Because LLMs are also perfect in creating prompts. I invite you all to the Prompt Engineering Conference I organize, in November and this will be second edition. First of edition was, last year, super popular. and, if you watch this session before November 20, just go and register your ticket. If you watch this after, still go to the same URL. You will find all sessions recorded straight there. And it's online, it's free. It's free, open for everyone. Thank you very much for watching this session and my last prompt for everyone. Let's stay in touch. This is my LinkedIn profile. Find me, just scan this QR code or find Maxim Salnikov Microsoft on LinkedIn. And it's my great pleasure to stay connected with you. Ask me about, prompt engineering, AI engineering, any questions or, any questions about web development and cloud in general. Thank you very much.
...

Maxim Salnikov

Digital and App Innovation Business Lead, Western Europe @ Microsoft

Maxim Salnikov's LinkedIn account Maxim Salnikov's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways