Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, thank you for your interest in this session. So in this
session unlocking reasoning and planning abilities in large
language models, I would like to take you through the different
methodologies and techniques to elicit reasoning abilities
from llms. So I'll be taking you through the
recent research works related to this.
So, about myself, I'm logesh Kumar Umapathi,
a lead measure learning research engineer at Sama. So my
research interests includes biomedical,
NLP, large language models and code generation.
So you can reach me through these social
media channels. And I'm also involved
in maintaining can open source package called mutate
which is about synthesizing data from large language models.
So the agenda for this session is we'll start
with understanding what is reasoning and how
the reasoning can be measured and is measured
in the research literature. And also
in the bulk of the session we will be discussing about how
to elicit reasoning. We'll be discussing different techniques
like direct prompting, direct one shot generation
of the solution, and then recursive and iterative prompting in
which we will be discussing techniques
to recursively and iteratively let the LLM generate the
solution. And then we will be discussing about tool usage,
which is the most popular one. Now with
the advent of hugging GPT and
hugging face agents and so on. So what
is reasoning? So, reasoning can be defined
as an ability to make inference from the given
evidences and logic. So there are different
types of reasoning like common sense mathematical and
symbolic reasoning. And reasoning can also be
defined as an ability to break
down a bigger problem into a smaller
solvable problems, and these recursively solve
the sub problems to solve these bigger problem.
Finally, so this can be considered
as a broad definition of
reasoning. So now that we know what is reasoning or
a broader definition of what we are trying with reasoning,
let's see how it's measured in the literature.
So in literature the reasoning is
usually measured as these separate categories,
mathematical reasoning, common sense reasoning and symbolic reasoning.
In mathematical reasoning, it's usually measured with math
world problems, usually math world problems that are
available online. So GSM eight k is about
grade school math world problems. This is from Opena.
And then the other data sets are benchmarks
also related to that. And then for common sense reasoning we
have arc a two reasoning challenge from allen
AI. So there we have science
question answering to measure the common sense
reasoning of the models. Then we have CsQA,
which is common sense question answering. And then we have strategy Qa
from Malin Aa. So these data sets
or benchmarks help in measuring the common sense ability of
these models. So to give an example, one of the questions
could be like,
would Aristotle have used a
keyboard? So this could be a question in that data
set, the model has to reduce that when
the keyboard was invented
and when Aristotle existed, and then it should deduce
whether it's possible or not. So these type of reasoning is covered
in common sense reasoning. Then we have symbolic reasoning,
which was mostly introduced by Jason V in
his chain of thoughts paper. So we have last letter concatenation
and coin flip type problems. These. So this
is how mostly the reasoning is measured in the
literature. To give an example of sample benchmark,
we have taken a snapshot of what is
available in GPT four technical review.
So this gives us an overview of how the reasoning
is measured and what the current state of things. So we can
see that the reasoning tasks like GSM eight k
and a two reasoning, the sort of for that
currently is 96 percentage and 92 percentage.
So now that we know what is reasoning and how is it measured,
let's see how and what are the methodology to elicit
reasoning. And before even going to that, let's see
why there is a need to elicit reasoning.
So given the size of these models,
this huge 175,000,000,000, 540,000,000,000 models,
one can think that why wouldn't reasoning be
come or generated by default by these
models? Why there is a need to elicit it? We'll first
try to address that and then come to the different
methodologies of eliciting reasoning.
So yesterday I tried this prompt in Chat GPT
for this session. So if you can see, I've tried to ask Chat
GPT to take the last letters of the word Augusta
ducking and concatenate them using a space.
So can see that the model has detected the
last words incorrectly. It has detected last letters incorrectly as
a G and G. And the final answer because
of that is also wrong. But when
we break down that, the same problems into
three different problems. So what are the words in
Augusta adaking? These model is able to come up with the words,
and then what are the last letters of these words? It's able to come up
with the last letters, AAG correctly. And when
I ask it to concatenate it, it's able to concatenate it. So this is
why we would need eliciting techniques
to elicit reasoning. So these models,
as the objective of its training, are not trained to
do reasoning, or at least from
my understanding, it's not trained to do reasoning.
It still has the tendency to do text completion
so that's why we would need the methodologies that we are
going to discuss in the further slides for eliciting
these reasoning. So let's start with probably
the most popular and also from my
understanding, these methodology which kick started
all the different prompting
techniques to elicit reasoning, the chain of
thought prompting. So in chain of thought prompting, the authors
JSon V had tried to do so what
they have tried to do is for all the mathematical and
other reasoning related questions,
instead of asking the direct answers to the model.
If we ask the model to generate step by step
reason and then generate the final answers,
answers. Finally, they found that the model tend to
do tend to generate answers
better. The accuracy of generation of answers was better.
So here, if you see the question, Roger has five
tennis balls, he buys two more cans of tennis balls,
each can has three tennis balls. How many tennis
balls does he have now? So there in the answer,
as a one shot example, they have explained the different steps
that can be deduced from this question and then the final answer.
So for a new question that the model sees here,
the cafeteria example, the model would come up
with a similar chain
of steps and then it would generate an answer
from this. So these methodology
has resulted or has shown to give
better results. You can see across all
the reasoning data sets and across all the
different models, this approach seems to give better results.
And there is another variation of it from Wang
et al. It's called as self consistency.
So in the chain of thought prompting, initial chain of thought prompting, the answers
are generated using greedy decoding. So there is only one generation
for a given prompt. So in the self consistencies,
the authors have tried to do sampling
based decoding to generate multiple generations
for a given prompt. And then they consider the most,
or the majority voted answers which are similar
or which are same, and then they evaluate that
particular answers against the evaluation set.
So that way the model performs better even than
chain of thought prompting. So their intuition
is that if a
model comes up with majority of solutions, a majority of
methods to come up with the same solution,
then they consider that it's most likely that that is a
proper and correct answer. So that's the intuition
for that and can extension to
chain of thought. Is that so? The challenge
with chain of thought is we are leaving
the arithmetic operations when it comes to mathematical
reasoning. We are leaving that to the
llms. So we all know that llms
lack even simple arithmetic abilities.
So in this paper, program aided
language models. So the authors have tried a
clever methodology where they have offloaded
the arithmetic calculations to the Python interpreter.
So the way they have done is they have created few
short prompts. So each With a question like we saw in China
of thought prompting. And they divided the
solutions, they have represented it
as a Python problem. So here you can see these
tennis ball, they have deduced the tennis ball
value from the question, and then what are the balls
that are bought? And then these answer. So this way
they have converted that to a pythonic solution.
So this way, for a new problem based on
the examples that are there in Fusot,
the model generates a similar Python
problem, reducing the question and then coming up
with the final answer. So to get the final answer,
these generated solution is executed
in the Python interpreter and that is considered as the final solution.
So this performed better than Chain of thought as you
can see in the results across all the mathematical reasoning
benchmarks. The main reason is we are using
llms for its advantages,
for its strengths, and these we
are offloading the weakness of llms to the Python
interpreter. And another variation of this
prompting, or this type of prompting, is plan and solve prompting.
So this is mostly to address the
performance of zero short chain of thought prompting.
So zero short chain of thought prompting is usually done by this
prompt where we ask, given a question, we ask let's
think step by step, and the model will come up with step by step thought
process and the final answer. But this wasn't working
that well, mainly due to different reasons. One of that
is the arithmetic ability, as we saw before. And then there
was few inference steps that are missed by the model
and few inference steps that are not converted
to solutions. In these zero shot can of thought
prompting. So this is rectified by a methodology
called as plan and solve prompting.
In these, the authors authors
try a different style of prompting called let's prompting.
That is, let's first understand the problem and devise a plan
to solve the problem. Then let's carry out the plan and solve these
problem step by step. So they ask
the model to come up with the plan first and then the solution
based on the plan that it has derived. So this way, these model,
they have show that the model performs better than
zero short chain of thoughts prompting,
and even they have shown that it performs better than few
short chain of thought prompting in some cases.
So until now we have seen methodologies of how to
inference or how to do in context, learning to
incontext, learning to elicit
reasoning abilities from llms.
But there are techniques that can be used to
fine tune our large language models
to elicit or improve the reasoning abilities
so we'll be seeing that in this section of
the talk. So one paper that does that is learning
math reasoning from cell sample, the correct and partially correct solution.
Here the authors have authors use
LLM. I think in these case they have used GPT Neo 2.7
billion model. So for
a given set of question, they ask the model to generate a
solution, a pythonic solution, and then they
evaluate the answer. When the answer matches with the ground truth or gold
answers that they have, they use these solution, they use
that in the fine tuning data set. Similarly they generate
solutions for whatever generations
that had got correct answers.
They filter those generation and then they
iteratively fine tune the same model on that,
same model on that. So they not only use
the fully correct solution, they also have introduced
a methodology to utilize partially correct solutions.
So the way they do partially correct solution is
they have a gold
solutions where they have outputs for individual
steps, as you see here and these similarly we
have a generated solution with individual outputs
from each of these steps. So whenever there is a match between
these individual steps in gold and the generated
one, they consider that as a partially correct solution and
then they use that to further fine tune the model.
So they have shown in their paper
that this type of iterative fine tuning on the
model generated solutions,
there is an improvement in mathematical reasoning
abilities of the model. So if you can see, the green
ones are the one which are fine tuned only on fully
correct solution and the orange ones are the one that are self
sampled with fully correct and partially correct solutions.
And we can also observe that the pass at one rate is
not improved. So the authors, authors comment that
this is mainly because the nature of
these training facilitates the
model to generate diverse set of solutions and it
does not make the model to favor any one particular solution.
That's why the pass at one is not improved much, but you can
see other improvements in other passet k values.
So another paper that does something similar is
self taught reasoner bootstrapping, reasoning with reasoning or
star. So here in this methodology,
the authors generate rational
and answers from an existing large language model.
And then whenever the answer is correct for that
particular gold standards, when they compare it with
the gold ground root data set that they have, they take
that, they put that into a fine tuning coppers along with
as a triplet, as question, rational and answer. So whenever
the answer is wrong, they hint the model
to generate a correct rationale by giving the correct answer from the ground
truth. So they ask the model to generate a rational and then they
put that back into the fine tuning mixture.
So that way they fine tune the model again so
that fine tuned model has a better ability to generate rational.
So these do this iteratively and then they have
a final model. So this has
proved to improve the mathematical reasoning
abilities even by using only the partial
training data set. So if you can see, the few short and
then fine tuned abilities of the GPTJ model has
improved from 5.8 to 10.7 here. So another
variation to this approach is this is more of distilling from
a large language model, a paper called specializing
smaller language models towards multistep reasoning by
few et al. Here the authors have tried to distill
the reasoning steps as well as the solution
from a large language model, a bigger model like from GPT-3
and then they fine tuned a smaller model like different t five versions,
different t five versions 250,000,000 760,000,003
billion. So they tried two different
variations. One is fine tuning only
on answers and then fine tuning on both answers and
chain of thought steps. So they found that
fine tuning on chain of thought and answers
are giving better accuracy as the model also
tries to understand the rationale of the answers. And we
could see the improvement here. Similarly,
they have tried that not only to vanilla Tfi but also to
flan t five. So flan t five shows a better improvement
compared to the vanilla t five models. So another
recent approach in distilling is
these distilling step by step paper from Hashe et
al. So here for an unlabeled
set of data set, they use a large language model like
a palm or a palm or a gpt-3
models to generate labels. Not only labels,
they also ask the model to generate the rational or chain
of starts for this particular answer,
particular answer. And then when they distill
it and train a smaller model, they train it on an objective
similar to a multitask planning. So they ask the model
to predict a label as well as the rational
for it instead of concatenating the rational and
label as one chunk. They had approached
this as a multitask planning. And then they have shown that these gives
a better improvement in improving
the smaller models reasoning
abilities. So the loss they have done is
they had taken a weighted loss of label
loss as well as the rational loss generation of
rational. So they show that these models,
the fine tuned model in this distilling step by step model performs
even better than a 540,000,000,000 model in
540,000,000,000 models, a few short generations.
So you could see that t five to 20 million sound,
70 million and 11 billion doing a better job
in the mathematical reasoning and common
sense Qa common sense Qa
task. So until now we saw how
a generation is made at one shot, that is,
given a set of prompt with few shot or zero shot examples.
The model generates in one shot the reasoning
as well as the answer for the given problem.
But as a human like how,
if we had, if we approach a problem iteratively,
we have a better chance to solve the problem more
accurately. So, similar intuition has been tried
with these llms. So we will be seeing those
methodologies in the papers in this section.
So one paper that implements is least
to most prompting. So the
idea of this paper is that they have broken down the
approach into two stages. The first stage they
prompt these model to come up with sub questions. So given
a broader question, they prompt the model to come up with
sub questions. And in the stage two
they sequentially ask the model to solve the questions one
by one. So for example, for the question given here,
the first stage, the model will come up with sub questions,
and in the second stage the original question is appended with the
sub question and the model has to answer that sub
question. And then the second sub question will be appended to
the first one and the overall one. And then the model has to
answer that and then until it comes up with the final answer.
So this way the author has shown that the model does
better compared to the vanilla
chain of thought prompting that we saw before. So this
is an example prompts that's been implemented
as part of this paper here. These have for the decomposition
stage where they ask the model to decompose the question into
different questions.
These were these few short examples that was given. And then
for the new set of example following this few shot,
the model has to come up with sub questions.
And the second part,
second stage where for problem solving, these are all the
few short examples that were given for the model to
solve the sub problem.
So another paper that implemented the recursive or
iterative prompting is
plan, eliminate and track. So they have done this in an interesting setting
where they have tried to evaluate and embodied the agent on
a data set called as half world
data set, which is about evaluating
the abilities of the agent to follow a given task
given the text word environment and a visual equivalent of
it. So they have
broken down their approach into different modules. One is a
planner module which is also can LLM. So it takes in the
instruction, it tries to convert that into a plan
that the agent needs to follow. And then there is an eliminator.
So eliminator based on the visual input and
also on the visual input of what is
there in the environment. The eliminator
tries to eliminate whatever that
the agent sees and what it needs to focus
on. And these the actor does the action and the tracker
tracks whether a given task is finished or not. Once it's finished,
it is updating the progress. So this overall approach is
in a way it's similar to what has been followed in auto
GPT and Baby Aga and all those applications.
So if you see here first for the task, heat some apple and
put it in the fridge. The LLMs first comes up with a plan. Take an
apple, heat the apple, place the apple in fridge,
and then an eliminator eliminates what
are the things that are not important for this particular
task to be completed. And an actor picks up
the, excuse me,
the actor picks up the action
that is more suitable for that particular task.
And then a tracker tracks the progress of it.
Another interesting paper that does this
iterative prompting is describe, explain,
plan and select. So in this paper, the authors
have tried to use an LLM to solve or
to play Minecraft. So minecraft,
as you'd know, it's an open ended game. So they have used llms
to play the minecraft. So here again,
similar to what we saw before,
they have split the approach
into different modules. One is planner module,
selector module and explainer module, and then a
describer module. So the first, for example
for these task how to mine one diamond from scratch.
The planner module comes up with a set of plans of what are
the tasks that needs to be done by the agent.
So here from the different set of ushot
examples in the ground truth plan, the planner would
come up with an actual plan that needs to be executed.
And then the selector, based on its knowledge of the
environment, it selects in that given step what
a goal that needs to be achieved. First based
on prioritizing
the different tasks involved.
And then that particular task is executed by
an executor. And that result of the executor is
given as a description by the descriptor. So it says,
if it finishes a goal, it says I finished goal g zero,
and then the selector goes to the next
task or next goal and so on. So this
is done recursively,
this is done by an LLM, it is prone to failures.
So when a particular plan has failed,
so these descriptor says I fail on this particular goal.
And then it also gives the details of the environment.
So based on that, an explainer explains what could
have actually gone wrong and it explains what needs to be done.
And then that goes to the planner. So planner then does
the replanning again. Replanning again.
And then this process continues until the final objective is
met. So this is again a very interesting
paper. I would urge the audience
to go to the GitHub repo and go through there. They have done a
wonderful implementation of their approach.
So until now we have seen how recursive
and iterative prompting can be used to elicit
reason. So now the
recent advancements have been enabled these LLMs
to use tools. So that comes in handy to
make the model even more accurate when it comes to reasoning
and planning. So you'll see a few examples of
how the tool usage is implemented
in some of the literature work.
So here the paper is react, reason and hacked.
So in these paper, the authors,
Yoert hall had broken down a
reasoning question to
use a tool like searching Wikipedia or
looking up Wikipedia, and then do and
come up with an answer based on that. So if you see here
for this reasoning question, when the model
tries to come up with a direct answer, it gets it
wrong. Even with cot, this is getting it wrong because
for this particular question, aside from Apple remote,
what other devices can control the program Apple remote was
originally designed to interact with? It needs an external
information for the model to be relied on.
So both these approaches are failing there.
So in act approach, what they
do is they come up with different set
of, so they have two variations. One is act only approach.
So in act only approach they come up with different actions that needs to
be taken and then they come up with an answer.
Then we have react which is based on reason and act which
is the actual paper. So first they come up with a different thought
and what act needs to be done, action needs to be done. And what
is the observation that is done from this action.
So based on that, that gets passed on to the next thought and
that is used to do the second act, and then observation and so
on. Finally they come up with these can answer.
So this way they iteratively prompt it. First thought
one is done and these act one action is generated
based on that. Once we have this, once the model generates
search the apple remote, the keyword is used to
search the Wikipedia and then an observation is appended to
the generation. So based on that a thought
two is done until the model generates finish
as one of the actions. So this is react,
reason and act.
Another paper that uses tools is
camelion plug and play compositional reasoning with
large language models. So here the authors have used
tool based reasoning and compulsion
for answering science question
answers as well as to answer table
based word problems. So here
in this science based question answering or
common sense question answering, here the image is given
and the question is what is the direction of this push? And then an
image is given and there is a question related to this.
So the LLM first tries to come up
with or decomposes this problem into set of
tools that it needs to call. And then there is a separate
set of prompts that are available for each of these
tools, which gets executed sequentially
to invoke that particular tool
and get these answers that get appended to the
original prompt. And these the process continues to get the
final answer. For example, for this question where we have
image and then a set of options, which is the main persuasive apple
used in this ad. So this particular image
is of an ADC. Paper plates now carry
the Sierra Club seal of approval. So it's an
ad. And then we have different options
whether this ad conveys petals, ethos or
logos, and these has
different options. So what the Cameleon does is it
first tries to call the text deductor as a
tool, and the text deductor deducts the text
and then it calls the knowledge retrieval. So knowledge retrieval
based on the input that is there, the knowledge retrieval
tries to come up with its inference of
the overall perspective of question and the information
that is available at that point. This is from
call to an opena API. And then
there is a solution generator which creates a descriptive
solution of what needs to be done, and then the answer
generator, which could be a rule based approach to generate the answers
from the solution that was generated by the model.
So these way this paper uses different tools,
right from hugging face models,
and then open a GPT models iteratively,
and also the other models like text deductor to
come up with a final solution.
Another example is the tab math world
problem solving data set, wherein for a given tabular
example, for a question that was asked,
the model has to come up with set of answers. So here the
model again uses different tools like knowledge retriever to
retrieve the knowledge that it has related
to the question that was asked. And then it goes to the table
verbalizer to verbalize what is there in the table,
and then for all the calculation that is offloaded
to a Python program and interpreter. And then
we have a program verifier which verifies whether the program is correct,
and then the program is executed and the answer is generated
from it. So let's see how a prompt
looks like for this.
So here as we can see,
the instruction or the prompt has different
tools that the model can use and then it also has the
context of what is the question and what are all the options
that are there in these question and metadata of the image.
And the model has to generate the set of modules
that it has to call, set of steps that it has
to execute, whether it has to execute text reductor first,
knowledge retrieval solution generator and answer generator.
So this way these model comes up with steps
and then each of these separate tools are
prompted to get the output. And then finally all
these outputs from these individual tools are concatenated
into one sequentially to generate the final
answer. So yeah,
that's pretty much I had for today. So there is
a lot that has come out recently.
Probably I might not have had a
chance to include it here, like tool formers, hugging GPT and so
on, but I
would like to acknowledge the sources
for this presentation. One is augment language
models survey from Milan
et al. And towards reasoning language models survey from Huang
et al. And then these blog posts. I will also urge
the audience to go through these
papers and these blog posts if you'd
like to learn further on this topic. So thank
you very much for your attention and
looking forward to hearing your feedback on these session
and also to have discussions on this topic can
be discussed today. Thank you.