Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone.
How's it going?
My name is Dan Cleary.
I'm the co founder of PromptHub and today we will be talking
all about prompt engineering.
We'll talk about why prompt engineering, is it dead?
system message versus prompt or user message, how different models
require different types of prompt and prompting methods, a variety of
different ways that you can try to get better outputs through prompt
engineering methods, best practices.
Does persona prompting even work?
Meta prompting and a whole bunch of templates and takeaways throughout.
So I like to start with this question of like, why prompt engineering?
this was a pretty popular opinion.
I'd say early on that prompt engineering wasn't really a thing, but I think Over
the months and years since ChatGPT came out, I think people have found that
it actually can be quite hard to get the type of output that you're looking
to get from a model consistently.
and that's where kind of prompt engineering comes into play.
And small changes just end up making a big difference here, because
of how models latent space works.
So if you say, write a code to render this image.
write secure code as if you were John Carmack, which is
like a famous software engineer.
You'll get drastically different outputs just by those small tweaks.
And I think that will always be there in some capacity.
you maybe have to do less of the engineering and method type work
in the future, potentially, but I think there's always going to be
a small place for this, at least.
Another reason why it's important is that it's one of the three major
ways to get Better outputs from LLMs.
and it's the starting point.
So you do a bunch of prompt engineering, you see where you're at, what problems
you're running into and whether you need to turn to other methods
to solve those remaining problems.
It's just a starting point for a lot of teams and it's very accessible.
You can be technical, you can be non technical, and you can
get your, up and running very quickly with prompt engineering.
And so lastly, you really can't avoid it.
And so for all those reasons, that's why I believe it's important, at
least for the time being, and I think into the near future here.
And lastly, in the same way that having good UI UX product experience
is a competitive advantage.
Having prompts that work well is a similarly important
competitive advantage for AI teams.
So now we'll talk about the different types of messages that models can
support, namely system versus user.
So system message, as you can see here, you are helpful assistant.
Then we have a user message, which is like the prompt where you're
sending to the model to get an output.
So on and so forth.
System message is optional.
Like when you're doing these things via the API.
so this is the stuff that you are, is behind chat GPT's interface in terms
of what open AI has programmed, the.
Chatbot to sound like and think and so it's optional when
you're sending it via the API.
It's used to set context and rules, so just the higher level things
versus low level instructions.
So setting the role, context, guiding the model behavior, controlling
format, things like that, versus the prompt is where you get more
specific, more of the contextual info.
The focus, so on and so forth.
And we have a couple of examples in the wild here from companies like OpenAI.
Anthropic also published theirs, on their documentation.
So you could see the system messages that power their Claude, chatbots.
And so now we'll talk about how different models require different prompts.
So if you're interchanging between providers and even models within the
same provider, you've probably run into this experience where each of
them has their own differences in the way they handle tasks and the
way that they sound and respond.
so for example, we'll take a chain of thought.
so this is thinking step by step, prompting the model to do some
reasoning before giving an output.
There's an experiment ran where they found that chain of thought actually reduced
the performance of Palm two, which is, an older Google model at this point, but just
goes to show that doing what is considered a best practice didn't help performance.
In this case, it actually made things worse.
And That just goes to show that one size is really not going to fit all, which is
the second paper we'll talk about, which is from VMware, where they basically
tested a wide variety of different prompt openers, descriptions, and closers.
They put all these prompts together using all of these different parts here.
Here are a couple examples of a math dataset with the
various system messages here.
Some have a role, some don't, so on and so forth.
And what they did was they had the model generate the best prompt.
for that specific task.
And so they, they let the model decide and do its own metaprompting based
on the task and the outputs that it was receiving and so on and so forth.
This was what Llama 2 created and you could see, it's including kind
of some Star Trek language here.
So this is after running hundreds of prompts, doing hundreds of metaprompting
in like an experimental way.
setting where there is, a high degree of statistical significance.
Versus Llama 2, 13b, so same family, same provider, doesn't know
mention of any Star Trek things, like it's all very cut and dry.
And then this was pretty popular last year, this kind of take a deep breath
and work on this problem step by step.
This was a similar experiment where they had the model figure out a top instruction
for itself, and they ran a bunch of tests to see which one goes to the top here.
And so they were, this is the top instruction for the A variety
of different models here, and we can see GPT's fours is You know,
five times larger than palm twos.
And so it just goes to show that everything is going to be a little
bit different depending on the model.
And that's why it's important to test these things.
And if you're looking for more information on models like max tokens,
context, windows, costs, features, and functionalities, we have a, directory
that we just launched that has all the information for basically all the model
providers that are the most popular.
All right So moving on to My two favorite prompt engineering practices and the
one that we tell the teams that we work with to focus on is you know giving the
model room to think this is a popular one and kind of relays into Chain of
thought to a degree you want to let the model think you don't want to force it
to give an answer Or overly constrain it.
You want it to go through some sort of reasoning process and
come to an answer on its own.
And then using delimiters or some way to better structure your prompt.
I can't tell you how many times I've had a team say, Hey, we're
struggling with this prompt.
Can you look at it?
I look at it and I can't even.
See what's going on.
And so that's a good litmus test is have someone else look at your
prompts, see if they can organize it or understand your organization.
And if not start to provide some of that via, delimiters, backticks,
quotes, whatever is going to help structure the prompt better.
Now we'll look at a few.
So just to set the stage, a zero shot prompt is just a normal prompt.
So if you hear that ever referenced, it's basically just a typical prompt like this.
A few shot prompt is when you include examples inside of your prompt.
So this would be all one prompt sent, but we're going to say,
show It's examples in line.
So we're classifying the sentiment of this feedback.
So this person's positive, then negative, then positive.
And then we're going to let the model kind of fill in the, the blank here.
And you could do this via multiple messages.
So like sending an array of messages from the API.
And that's typically, I think, technically what few shop really is rather than
having it all be in one message.
Because the model will handle it different if it has it in its history
versus reading it all in a prompt.
Both ways are effective.
Both ways are worth testing.
SureShot Prompting is really helpful in a variety of domains.
It can help with structure.
format, content, style, and tone, and those are really the really big
areas that I've seen it be helpful.
How many examples?
The great benefit here is that you get a lot of the gains from
just having one or two examples.
And then it plateaus and can even degrade in a lot of situations.
we say, hey, start with, Anywhere from two to five examples, if you're still
not getting the performance you're looking for, you might need to look
elsewhere because you have the chance of it starting to degrade, the performance.
A couple of other important, best practices here.
Use diverse examples.
So if you're doing a sentiment analysis, don't use only positive ones.
Use a combination.
Have them cover a wide range of what you are going to be
expecting in your application.
So cover those edge cases.
Random, randomly order them so you wouldn't have all the positives in one
section and then all the negatives.
And then make sure they follow a common format so the model can
better learn, in context there.
And we have a whole guide on this as well, with lots of examples and templates.
Alright, next up is according to prompting, which is basically
just trying to ground the model in a specific set of information.
so you can see in this original prompt, it's asking a question, and then it just
has, adds in according to Wikipedia.
So it's trying to guide the model to the type of answer that you are looking for.
are looking to.
And this can be helpful, especially if you've done like some fine tuning.
And here's an example of that.
Again, all the templates are available in PromptTab under
the templates tab for free too.
And then last one I believe is called step back prompting.
So this is a very similar kind of, I say, variant of chain of thought prompting,
where you send a question, you have the model kind of think step by step
in first kind of abstract concepts, and then Use those abstractions to reason
through the question or the task.
And so you prompted to do this thinking first, and this is what you see with a
one preview and a one mini where it thinks about the step that reasons about them.
And then it solves the problem.
So again, yeah, these are all linked in prompt up.
So app dot prompt dot U S slash templates.
Last up.
And my favorite one is persona prompting.
So this is like very popular for a long time now.
so this is like giving the.
the model, a persona to solve a certain task.
And there are a lot of papers on both sides of this in
terms of how effective it is.
I have come out to the other side to thinking it's actually not that
effective in certain use cases.
And the main reason comes from a learn prompting.
org paper, which is linked here.
I had this intuition that it wasn't great for doing accuracy based tasks,
and this really reinforced that.
So basically they set up an experiment.
They ran 2000 prompts, and it was a MMLU case, so like a knowledge based, task, and
they gave a bunch of different roles here.
And really the kind of long and short of it is that.
The genius when they were told the model, it was a genius.
It got a lower percentage than when it told it, it was like an
idiot or something like that.
Yeah.
The genius was actually the worst performing one here.
And so how can you reconcile that and still think that role prompting works?
I don't know.
we have to look at other data, but this seems just pretty strong to rule out
that conclusion and everything else.
I would say it may just be anecdotal, but it is helpful in terms of,
I would say, Tone and style.
So if you're doing content generation, things like that, but not for increasing
accuracy, last up, we'll talk a little bit about meta prompting.
So what's meta prompting.
It's a prompt engineering method that uses the LLM to help you write your prompt.
So using chat GPT to help write your prompt.
And we are big proponents of this.
We think this is how prompt engineering should really be done alongside
the model that you're working with.
And the same way they use, AI and LLMs for writing and coding, you should
do it for prompt engineering as well.
work together to form a good prompt for your use case, and then go and test
that, and continue that iterative loop.
And there's a bunch of tools out there to do this.
We have one that we launched, and specifically, we will run a different,
Prompt to generate your prompt.
So that meta prompt based on whatever provider you're using, because as we
saw before, every provider is different.
So we've baked in those differences into the meta prompt
for each one of the providers.
Leverages best practices.
It's free.
You can use it in our app without any account.
so that's good to go.
Anthropic was one of the first ones to do this, and I think they have a really great
grasp on prompt engineering in general.
So you can use this in the Anthropic console.
A bunch of best practices built in.
It's open source.
it does charge.
that's nominal.
And then OPI actually just released one, past month or so.
You can use it in the playground and it generates system messages only.
but it's still usable and fun.
And we did a little bit of prompt injections to get the prompt behind
that because it wasn't open source.
And so that was really cool to see.
It's always interesting to see how The model providers are writing prompts.
So that's available in prompt as well.
Wrapping up four things you can do today, structuring your prompts with headers,
delimiters, that's going to be a big help.
The more specific you are in your instructions, the
better your prompt will be.
You can throw out all the other methods.
if you can just nail that part, that's great.
And if you don't nail it, all the other stuff really isn't going to help you.
Examples of VFU shop prompting is a great way.
I think metaprompting plus a few shot plus chain of thought is like really going
to be the winning formula going forward.
and then don't overly constrain the model.
And thank you.
I hope you enjoy this.
If you want to talk about this, feel free to reach out.
we're active on LinkedIn, and yeah, have a great rest of your day.