Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, today I'm going to talk about a creative approach I've
devised that harnesses multiple llms
in order to achieve a high quality output.
It's something I call collaborative AI, and my
hope today is that you will be able to unlock its
potential to really push the upper limits of what is
possible in terms of content quality and LLM
output. Today I'm going to use Chachi PT
four and cloud three opus to showcase the power of
collaborative AI. Before I do so, I wanted to talk a little
bit about how content creation pre AI has
typically played out in my field. Online learning. The bar for
creating educational content is usually very, very high
when it comes to factuality. We hire what are known
as SME's or subject matter experts, and their job
is to be as accurate as possible, both in writing and and in reviewing
content. They essentially are the domain experts
of their field, be it art history or upper
division calculus. What they are not oftentimes
is professionally trained writers. As a result,
their writing, while grammatically sound, factually accurate,
can sometimes come across as a little bit dry,
unengaging, and a bit repetitive. But this isn't helped
by the fact that the amount of content often needed by online providers is
staggering, and SME's have to work under a tight deadline.
Engaging writing with memorable examples, smooth transitions,
and that writerly touch are oftentimes out of reach
for SME's, even those with professional training in writing.
With AI, we are now able to get content out much faster,
but not without potential pitfalls, for one,
hallucinations where the LLM generates inaccuracies
and even outright falsehoods. There's this fear that for first time learners,
they might end up thinking that the civil war happened only last century.
While such glaring falsehoods aren't necessarily that
common, smaller inaccuracies do occur.
Then there's the question of writing do AI models
have the ability to take otherwise potentially dry educational
content and make it exciting and interesting while
still being accurate and and able to convey a sense of authority? The PowerPoint
presentation that follows I'm going to take a piece of educational
content that I'm going to have Claude three opus and Chachi Pt
four evaluate, and then I'm going to have them generate
their own versions. But I'll go further than that,
leveraging collaborative AI as I take inputs and outputs from one
LLM and feed them into the other, using collaborative
AI to improve upon those outputs so that the
final product is greater, better than
anything that either LLM could have generated by itself.
So let's dive in and see collaborative AI in action. So here we
are with transforming content creation with collaborative AI.
The first thing I did was create a little experiment.
The idea here was that we needed to take some baseline,
some standard content that the llms could improve upon,
and we needed to make sure that there was some scoring around that
baseline sample. Otherwise it would be difficult to say whether and how much
the other LLM generated outputs improved by.
So first off, we needed a piece of education content, something that
could serve as our baseline. And of course we needed to choose a
topic as well. And then we needed to define the criteria
of quality, like what made this a strong or not strong
piece of writing, and what did we want the two llms, in this case
chachi pt four and clot three opus, to focus
on when generating a high quality sample.
So that again speaks to the idea of to establish a quality
baseline. So what I did is I had chat GPT four actually write
a chapter, and then later clod three
scored that chapter. Now I'm going to go through each one of these parts,
starting with the piece of education content and then ending with more details
about establishing that quality baseline.
First off, the piece of content, I decided it was going
to be a 500 to 600 word article on large language models,
and I think that makes sense given the target audience.
But I didn't just have it write the large language model.
Instead, I fed Claude 33 online learning excerpts
from different topics, different areas,
something that it could model when it
actually generated its own article, and I chose samples that
were indicative of more average online
learning content. So there's a lot of great online learning content out there,
and I definitely don't want to cast aspersions upon the field, but I was going
for something a little bit more average, something that if someone was
under a deadline, they might end up creating. And so I
fed that to Claude three and had it actually characterized
the writing, which is a step I like to do with llms. It's a reflection,
sort of step in between before they actually generate an
article. Now, you don't have to do this, but it's something that I did before
it actually generated the writing. And in doing so, it identified
eight characteristics from these excerpts and also was a sanity
check just to make sure that what I thought wasn't great writing, that it could
back me up there as well. And indeed it did here.
Came up with a total of eight. I've only posted six and
a half here, but it gives you an idea. The point here isn't to read
through each one of these, but that there definitely are
lapses in quality. And now once the LLM has
that, it can generate this piece of content here, which again is a 500
or 600 word chapter on llms. And I actually
used my editorial eye just a little bit and looked at those
eight characteristics as well, courtesy of cloud three. And I changed,
tweaked just a few things, but nothing major.
And this is what we had, or what we ended up with.
Again, not going to pause here too long,
I don't think the point here is to really read this. In fact, I only
excerpted it because this is clearly not 500 to 600 words,
but the actual thing from which, or the text from which this is exerted
was around the 500 mark. So usually llms aren't that great at counting,
but they did a pretty good job here. But what is mediocre
about this? Let's just really quickly look at that first sentence where it says llms
are a type of AI, artificial intelligence. Then notice the
second sentence llms use. So it repeats
that exact same noun, and that gives rise to a repetitive,
dry kind of writing. And if you dive in here a little bit more,
you'll see that as well. Third paragraph starts off with llms,
but the idea here is it's the quality of writing that we're going
for, and it just doesn't hit the mark. Next, we wanted to
define criteria that we were looking for in
good writing. So when we have the llms create quality output,
what are we defining as quality? So we marked here some criteria. The LLM,
in this case cloud three, was able to come up with five categories
here, engaging language and storytelling, relatable examples, thought for broken
questions, sentence structure, clarity, et cetera. And this
is what it identified. And what I agreed were
hallmarks of strong, engaging educational
content. So again, in establishing the baseline,
we got a score out of four out of ten. But I didn't want to
just stop there. I asked myself, what if we just asked
the LLM in a one shot prompt to come
up with a chapter for an education course, online learning
course on llms, what would it come out with? And it
came out with something, this one shot prompt, and it got a seven out of
ten. Now, I'm not going to paste that here, but I'll say it was high
level, generic, typical AI stuff. This was a good baseline for
me, because if we use collaborative AI here,
or if I use collaborative AI and it turns out I also get
a seven, then there doesn't seem to be much point in collaborative AI
when you can just do a one shot prompt that will get you a decent
seven out of ten score. But let's see what actually happens when
we use collaborative AI now. You'll notice it says
pre step one, so we're not quite there, and sorry
to be teasing you on this, but we're almost there in the
next slide. For now, though, the pre step prompt I did was
I asked cloud three and chat GPT four to identify characteristics of the
original sample and score it. That's that reflection piece that we did earlier on.
So this isn't an integral part of collaborative AI, but just something nice to
do. And then the second pre step was to
ask it to generate a sample as close to a ten out of ten as
possible. And this is where the collaborative AI
process and machinery now starts with step one.
Here, what I did was I input a version one from each LLM
into the other one, asking to evaluate it on a score from one through ten.
So, for example, clot three created a version
on that pre step number two just a second ago, created that version one,
and then I fed that version into chat GPT.
But look at that second part there, where it says other LLM
comma. That's the second part asking to evaluate.
So I didn't just input the version, but I actually asked it to
rate it and score it, much the way a teacher or a professional would
do. And so with that evaluation in hand, then go
to the next step here, which is take the version one evaluation
from an LLM, or from one of the llms, and put it back into the
other LLM. I know this can get a little crisscrossy, but to give you an
example here, the chat GPT's evaluation,
which was on version one of cloud three. I then put it back
into cloud three, but then there's a second part to
step two, which is inputting it into the first LLM
for rewrite. And that brings us to step number three, where I take the rewrite,
which we're now calling version two, and I input it back
into the other LLM for evaluation and scoring.
And so the thing is, step back for a moment. We can think of it
as I gave cloud three an opportunity to
do a rewrite the way we would in a classroom, and then we get feedback
from a teacher. And version two is its rewrite based on this
evaluation and scoring. And then at
that point, was there a difference between version one and
two in terms of score? Now you can carry this process
on and on. You could have a version three, a version
four, version five. But I think at a certain point there are diminishing returns.
And so what we're trying to see here in this little experiment is,
was there an improvement between version one and version two?
So let's see what happens before we get too excited.
We have a step number for it. I think it's very important is to check
for hallucinations by inputting version two into the other LLM.
So essentially, we're using collaborative AI to do
hallucination checks. I know I threw a lot of text and
words at you, but if you pause here for a moment, you can see this
collaborative AI structure use here, spread out here for each
one of the steps. Again, there could be more steps if you wanted
to do more than just two versions. But this is the bare bones,
basic little experiment version that we are doing here. So maybe you're
curious now, what was chat DPT in cloud three's first
version, and how was that scored? I'm happy you asked.
Let's dive in here. The chat GT's first version,
we can see that it gets a seven out of ten based on
criteria of engaging writing, etcetera, which isn't great
given that the one shot prompt also got us a seven
out of ten. But at least it's a starting point.
Hopefully the second version will be better.
And how did Claude three do? Let's see what teacher Chachi Pt
four has to say. They give it. If you look there at the bottom
of the first paragraph, it says, I would rate this version an eight out of
ten. So it did a little bit better. But for
now, this is enough to give you an idea of how this works.
So here's the second step. We feed the evaluations from
one LLM back into the other for a rewrite.
Now, in this case, what I did was I actually
exerted the entire thing, and I did that for a reason. I think it's important
to see just how detailed these evaluations
are. So when the LLM is getting its feedback, you can think of
it as a prompt. Imagine writing a prompt that
is this long and, aha. Even longer.
So that's not necessarily bad thing, given that LLMs often thrive off
of this level of specificity, and there
is a lot of specificity going on, does it actually amount
to anything in version two? Meaning,
will the LLms write a better version of
the chapter? I'm happy you asked, because now we're
at the point where we can ask it, come up with a version that
gets a perfect tense, so we've definitely raised the bar, but there's a
lot of specificity. And that, of course, is what
makes collaborative AI so powerful. How does
this rate we get this is the version two from
Claude, and this is the version two from
ChatGpt. And Drumroll. Their scores
a 9.5 out of ten. So you can see this
is Claude rating chat GPT. So even though chat GPT's first attempt
was a measly baseline seven, this one got close to
a ten, and Claude's version got a solid
nine, if you look at the last or the bottom of the first paragraph.
But in both cases, the version two was much better.
So we can see the third step scores here. Version one,
seven for chachi, BT for claude, eight. Version two,
9.5 and nine. So both market improvements simply
by using just one round of collaborative AI.
Now, you might be asking, well, what about the
human in the loop? In this case me? Did I agree with these versions?
And so, yes, I did read these versions, and I
agreed with them in terms of the improvement. They were
both much better than the first versions.
And in fact, based on the criteria that established, I essentially agreed with these scores.
The reason I hesitate in saying I wholeheartedly agree
with them was I felt at times they were maybe trying to be a little
too engaging, a little bit too fun.
But that wasn't necessarily part of the evaluation criteria. So that's something
that, as the human in the loop, I can ask in a version three
just to tweak that. So it doesn't sound like you're trying to be
someone's pal and trying to be too relatable.
But again, everything besides that was at a much higher level,
making it solid educational content that I think would
really pull in audiences and make learning so much more
fun and enjoyable. Before we go on, though, I want us
to compare here to the baseline scores just one
last time, just to see where we came from, you know, speaking about education content
being more fun and enjoyable. If you're going from a four out of ten,
again, not all education content that's human created is
a four out of ten. But if kind of the average ish
is, then to go from four to nine to 9.5 is a huge
improvement. And the fact that collaborative AI, at least just with this one
round, is something that doesn't take long at all compared to some of these editorial
and content creation processes that involve multiple
SME's, both creators and reviewers, several rounds,
and someone oftentimes overseeing that entire process, you can
see that this can be costly and again, if that
standard is only a four, then we're also getting a huge,
huge bump in quality. Finally, there's that hallucination check.
Both pieces passed. I think, though, it's always
super important to have a human in the loop, especially for something like education content,
where you do not want to have facts that are incorrect no matter
what. Now for the broader implications.
Quality of content and speed is vital.
Collaborative AI can be a huge addition to whatever AI they're
currently using. Now, if they're not using any AI, then they enter AI or
LLMs at a much higher level than they would with one shot,
prompting digital marketing, PR, and corporate communications.
These are just a few here of the areas where high
quality and engaging content that will really help people learn
something. For instance, the case of healthcare communication. If something is dry
and drought, patients aren't likely to remember that, make it engaging
at that nine 9.5 level, then suddenly it's something
that's a lot easier for them to learn and pay attention to, something that's really
important in health. But again, coming back to that hallucination thing,
always a human in the loop for many of these different industries.
So now for the closing thoughts. This is an interesting
one. It's the idea that a 9.5 out of ten for one article
isn't quite the same as a corpus or a body of articles.
Why? Well, imagine that 9.5 that we saw,
and I believe that was the one that Chachi bt outputted.
Imagine that that was the exact same one over
and over again. Now I just use that word imagine twice
in a row, almost as a joke. And the reason that was a
joke is back earlier in the presentation.
It might not have been obvious because I didn't focus on it, but two
of the articles use the word imagine a
world, or imagine something to start off,
and that alone is a little bit confusing.
And it shows that if you had that for, say, 50 articles,
and maybe 20 of them had that phraseology. So correcting for
something like this at scale is something that's important to keep in mind
early on when you are creating these pieces. And again, having that
human in the loop, someone who can really wield the AI and wield something like
collaborative AI, will make it less likely that you're going to see very
common or similar opening lines or really similar writing
throughout. But that said, civil writing is part of AI,
and there's almost this AI speak. After all, we have the GPT
zeros of the world that can identify AI generated language or
text for a reason. It's because there is a certain pattern that
makes it slightly different from human generated text.
One way around this is to actually feed models with high caliber human
text for inspiration. And so if you
want to make it sound even more like a person, more relatable,
perhaps make it so that it's not always saying, imagine a
world where and using some of those other giveaways
of AI speak, then using
high caliber human pros that you want it to model is a good way around
that. These are just a few ideas of improving the
AI using collaborative AI, but in general, lots of
different ways that we can use collaborative AI. Not just coming up with
subsequent versions, one after the other, but maybe even leveraging more than
two models, having three models, having a model judge its
own writing in a different thread and then compare it to what the other LLM
said. See how similar those are. There's so much
you can do here again, and to use AI speak in
a world where the possibilities are limitless,
collaborative AI could be a game changer.