Unlocking reasoning and planning abilities in Large language models

Video size:

Abstract

We will explore the latest breakthroughs and discuss how LLMs can be used to solve complex problems that require reasoning and planning. By unlocking these capabilities, LLMs can be used to build sophisticated chatbots, intelligent assistants, and other NLP applications

Summary

In this session, logesh Kumar Umapathi will take you through the different methodologies and techniques to elicit reasoning abilities from llms. His research interests include biomedical, NLP, large language models and code generation.
In literature the reasoning is usually measured as these separate categories, mathematical reasoning, common sense reasoning and symbolic reasoning. So now that we know what is reasoning and how is it measured, let's see how and what are the methodology to elicit reasoning.
Chain of thought prompting is one of the most popular techniques to elicit reasoning. Another variation of this prompting, or this type of prompting, is plan and solve prompting. These methods have shown to give better results across all the reasoning data sets.
So one paper is learning math reasoning from cell sample, the correct and partially correct solution. Another recent approach in distilling is these step by step paper from Hashe et al. This has proved to improve the mathematical reasoning abilities even by using only the partial training data set.
One paper that implements is least to most prompting. First stage prompts model to come up with sub questions. In stage two they sequentially ask the model to solve the questions one by one. This way the author has shown that the model does better compared to the vanilla chain of thought prompting.
Another paper that implemented the recursive or iterative prompting is plan, eliminate and track. They have tried to evaluate and embodied the agent on a data set called as half world data set. And these the actor does the action and the tracker tracks whether a given task is finished or not.
Another interesting paper that does this iterative prompting is describe, explain, plan and select. Another paper uses plug and play tools for answering science questions. Recent advancements have enabled LLMs to use tools to make the model even more accurate when it comes to reasoning and planning.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi, thank you for your interest in this session. So in this session unlocking reasoning and planning abilities in large language models, I would like to take you through the different methodologies and techniques to elicit reasoning abilities from llms. So I'll be taking you through the recent research works related to this. So, about myself, I'm logesh Kumar Umapathi, a lead measure learning research engineer at Sama. So my research interests includes biomedical, NLP, large language models and code generation. So you can reach me through these social media channels. And I'm also involved in maintaining can open source package called mutate which is about synthesizing data from large language models. So the agenda for this session is we'll start with understanding what is reasoning and how the reasoning can be measured and is measured in the research literature. And also in the bulk of the session we will be discussing about how to elicit reasoning. We'll be discussing different techniques like direct prompting, direct one shot generation of the solution, and then recursive and iterative prompting in which we will be discussing techniques to recursively and iteratively let the LLM generate the solution. And then we will be discussing about tool usage, which is the most popular one. Now with the advent of hugging GPT and hugging face agents and so on. So what is reasoning? So, reasoning can be defined as an ability to make inference from the given evidences and logic. So there are different types of reasoning like common sense mathematical and symbolic reasoning. And reasoning can also be defined as an ability to break down a bigger problem into a smaller solvable problems, and these recursively solve the sub problems to solve these bigger problem. Finally, so this can be considered as a broad definition of reasoning. So now that we know what is reasoning or a broader definition of what we are trying with reasoning, let's see how it's measured in the literature. So in literature the reasoning is usually measured as these separate categories, mathematical reasoning, common sense reasoning and symbolic reasoning. In mathematical reasoning, it's usually measured with math world problems, usually math world problems that are available online. So GSM eight k is about grade school math world problems. This is from Opena. And then the other data sets are benchmarks also related to that. And then for common sense reasoning we have arc a two reasoning challenge from allen AI. So there we have science question answering to measure the common sense reasoning of the models. Then we have CsQA, which is common sense question answering. And then we have strategy Qa from Malin Aa. So these data sets or benchmarks help in measuring the common sense ability of these models. So to give an example, one of the questions could be like, would Aristotle have used a keyboard? So this could be a question in that data set, the model has to reduce that when the keyboard was invented and when Aristotle existed, and then it should deduce whether it's possible or not. So these type of reasoning is covered in common sense reasoning. Then we have symbolic reasoning, which was mostly introduced by Jason V in his chain of thoughts paper. So we have last letter concatenation and coin flip type problems. These. So this is how mostly the reasoning is measured in the literature. To give an example of sample benchmark, we have taken a snapshot of what is available in GPT four technical review. So this gives us an overview of how the reasoning is measured and what the current state of things. So we can see that the reasoning tasks like GSM eight k and a two reasoning, the sort of for that currently is 96 percentage and 92 percentage. So now that we know what is reasoning and how is it measured, let's see how and what are the methodology to elicit reasoning. And before even going to that, let's see why there is a need to elicit reasoning. So given the size of these models, this huge 175,000,000,000, 540,000,000,000 models, one can think that why wouldn't reasoning be come or generated by default by these models? Why there is a need to elicit it? We'll first try to address that and then come to the different methodologies of eliciting reasoning. So yesterday I tried this prompt in Chat GPT for this session. So if you can see, I've tried to ask Chat GPT to take the last letters of the word Augusta ducking and concatenate them using a space. So can see that the model has detected the last words incorrectly. It has detected last letters incorrectly as a G and G. And the final answer because of that is also wrong. But when we break down that, the same problems into three different problems. So what are the words in Augusta adaking? These model is able to come up with the words, and then what are the last letters of these words? It's able to come up with the last letters, AAG correctly. And when I ask it to concatenate it, it's able to concatenate it. So this is why we would need eliciting techniques to elicit reasoning. So these models, as the objective of its training, are not trained to do reasoning, or at least from my understanding, it's not trained to do reasoning. It still has the tendency to do text completion so that's why we would need the methodologies that we are going to discuss in the further slides for eliciting these reasoning. So let's start with probably the most popular and also from my understanding, these methodology which kick started all the different prompting techniques to elicit reasoning, the chain of thought prompting. So in chain of thought prompting, the authors JSon V had tried to do so what they have tried to do is for all the mathematical and other reasoning related questions, instead of asking the direct answers to the model. If we ask the model to generate step by step reason and then generate the final answers, answers. Finally, they found that the model tend to do tend to generate answers better. The accuracy of generation of answers was better. So here, if you see the question, Roger has five tennis balls, he buys two more cans of tennis balls, each can has three tennis balls. How many tennis balls does he have now? So there in the answer, as a one shot example, they have explained the different steps that can be deduced from this question and then the final answer. So for a new question that the model sees here, the cafeteria example, the model would come up with a similar chain of steps and then it would generate an answer from this. So these methodology has resulted or has shown to give better results. You can see across all the reasoning data sets and across all the different models, this approach seems to give better results. And there is another variation of it from Wang et al. It's called as self consistency. So in the chain of thought prompting, initial chain of thought prompting, the answers are generated using greedy decoding. So there is only one generation for a given prompt. So in the self consistencies, the authors have tried to do sampling based decoding to generate multiple generations for a given prompt. And then they consider the most, or the majority voted answers which are similar or which are same, and then they evaluate that particular answers against the evaluation set. So that way the model performs better even than chain of thought prompting. So their intuition is that if a model comes up with majority of solutions, a majority of methods to come up with the same solution, then they consider that it's most likely that that is a proper and correct answer. So that's the intuition for that and can extension to chain of thought. Is that so? The challenge with chain of thought is we are leaving the arithmetic operations when it comes to mathematical reasoning. We are leaving that to the llms. So we all know that llms lack even simple arithmetic abilities. So in this paper, program aided language models. So the authors have tried a clever methodology where they have offloaded the arithmetic calculations to the Python interpreter. So the way they have done is they have created few short prompts. So each With a question like we saw in China of thought prompting. And they divided the solutions, they have represented it as a Python problem. So here you can see these tennis ball, they have deduced the tennis ball value from the question, and then what are the balls that are bought? And then these answer. So this way they have converted that to a pythonic solution. So this way, for a new problem based on the examples that are there in Fusot, the model generates a similar Python problem, reducing the question and then coming up with the final answer. So to get the final answer, these generated solution is executed in the Python interpreter and that is considered as the final solution. So this performed better than Chain of thought as you can see in the results across all the mathematical reasoning benchmarks. The main reason is we are using llms for its advantages, for its strengths, and these we are offloading the weakness of llms to the Python interpreter. And another variation of this prompting, or this type of prompting, is plan and solve prompting. So this is mostly to address the performance of zero short chain of thought prompting. So zero short chain of thought prompting is usually done by this prompt where we ask, given a question, we ask let's think step by step, and the model will come up with step by step thought process and the final answer. But this wasn't working that well, mainly due to different reasons. One of that is the arithmetic ability, as we saw before. And then there was few inference steps that are missed by the model and few inference steps that are not converted to solutions. In these zero shot can of thought prompting. So this is rectified by a methodology called as plan and solve prompting. In these, the authors authors try a different style of prompting called let's prompting. That is, let's first understand the problem and devise a plan to solve the problem. Then let's carry out the plan and solve these problem step by step. So they ask the model to come up with the plan first and then the solution based on the plan that it has derived. So this way, these model, they have show that the model performs better than zero short chain of thoughts prompting, and even they have shown that it performs better than few short chain of thought prompting in some cases. So until now we have seen methodologies of how to inference or how to do in context, learning to incontext, learning to elicit reasoning abilities from llms. But there are techniques that can be used to fine tune our large language models to elicit or improve the reasoning abilities so we'll be seeing that in this section of the talk. So one paper that does that is learning math reasoning from cell sample, the correct and partially correct solution. Here the authors have authors use LLM. I think in these case they have used GPT Neo 2.7 billion model. So for a given set of question, they ask the model to generate a solution, a pythonic solution, and then they evaluate the answer. When the answer matches with the ground truth or gold answers that they have, they use these solution, they use that in the fine tuning data set. Similarly they generate solutions for whatever generations that had got correct answers. They filter those generation and then they iteratively fine tune the same model on that, same model on that. So they not only use the fully correct solution, they also have introduced a methodology to utilize partially correct solutions. So the way they do partially correct solution is they have a gold solutions where they have outputs for individual steps, as you see here and these similarly we have a generated solution with individual outputs from each of these steps. So whenever there is a match between these individual steps in gold and the generated one, they consider that as a partially correct solution and then they use that to further fine tune the model. So they have shown in their paper that this type of iterative fine tuning on the model generated solutions, there is an improvement in mathematical reasoning abilities of the model. So if you can see, the green ones are the one which are fine tuned only on fully correct solution and the orange ones are the one that are self sampled with fully correct and partially correct solutions. And we can also observe that the pass at one rate is not improved. So the authors, authors comment that this is mainly because the nature of these training facilitates the model to generate diverse set of solutions and it does not make the model to favor any one particular solution. That's why the pass at one is not improved much, but you can see other improvements in other passet k values. So another paper that does something similar is self taught reasoner bootstrapping, reasoning with reasoning or star. So here in this methodology, the authors generate rational and answers from an existing large language model. And then whenever the answer is correct for that particular gold standards, when they compare it with the gold ground root data set that they have, they take that, they put that into a fine tuning coppers along with as a triplet, as question, rational and answer. So whenever the answer is wrong, they hint the model to generate a correct rationale by giving the correct answer from the ground truth. So they ask the model to generate a rational and then they put that back into the fine tuning mixture. So that way they fine tune the model again so that fine tuned model has a better ability to generate rational. So these do this iteratively and then they have a final model. So this has proved to improve the mathematical reasoning abilities even by using only the partial training data set. So if you can see, the few short and then fine tuned abilities of the GPTJ model has improved from 5.8 to 10.7 here. So another variation to this approach is this is more of distilling from a large language model, a paper called specializing smaller language models towards multistep reasoning by few et al. Here the authors have tried to distill the reasoning steps as well as the solution from a large language model, a bigger model like from GPT-3 and then they fine tuned a smaller model like different t five versions, different t five versions 250,000,000 760,000,003 billion. So they tried two different variations. One is fine tuning only on answers and then fine tuning on both answers and chain of thought steps. So they found that fine tuning on chain of thought and answers are giving better accuracy as the model also tries to understand the rationale of the answers. And we could see the improvement here. Similarly, they have tried that not only to vanilla Tfi but also to flan t five. So flan t five shows a better improvement compared to the vanilla t five models. So another recent approach in distilling is these distilling step by step paper from Hashe et al. So here for an unlabeled set of data set, they use a large language model like a palm or a palm or a gpt-3 models to generate labels. Not only labels, they also ask the model to generate the rational or chain of starts for this particular answer, particular answer. And then when they distill it and train a smaller model, they train it on an objective similar to a multitask planning. So they ask the model to predict a label as well as the rational for it instead of concatenating the rational and label as one chunk. They had approached this as a multitask planning. And then they have shown that these gives a better improvement in improving the smaller models reasoning abilities. So the loss they have done is they had taken a weighted loss of label loss as well as the rational loss generation of rational. So they show that these models, the fine tuned model in this distilling step by step model performs even better than a 540,000,000,000 model in 540,000,000,000 models, a few short generations. So you could see that t five to 20 million sound, 70 million and 11 billion doing a better job in the mathematical reasoning and common sense Qa common sense Qa task. So until now we saw how a generation is made at one shot, that is, given a set of prompt with few shot or zero shot examples. The model generates in one shot the reasoning as well as the answer for the given problem. But as a human like how, if we had, if we approach a problem iteratively, we have a better chance to solve the problem more accurately. So, similar intuition has been tried with these llms. So we will be seeing those methodologies in the papers in this section. So one paper that implements is least to most prompting. So the idea of this paper is that they have broken down the approach into two stages. The first stage they prompt these model to come up with sub questions. So given a broader question, they prompt the model to come up with sub questions. And in the stage two they sequentially ask the model to solve the questions one by one. So for example, for the question given here, the first stage, the model will come up with sub questions, and in the second stage the original question is appended with the sub question and the model has to answer that sub question. And then the second sub question will be appended to the first one and the overall one. And then the model has to answer that and then until it comes up with the final answer. So this way the author has shown that the model does better compared to the vanilla chain of thought prompting that we saw before. So this is an example prompts that's been implemented as part of this paper here. These have for the decomposition stage where they ask the model to decompose the question into different questions. These were these few short examples that was given. And then for the new set of example following this few shot, the model has to come up with sub questions. And the second part, second stage where for problem solving, these are all the few short examples that were given for the model to solve the sub problem. So another paper that implemented the recursive or iterative prompting is plan, eliminate and track. So they have done this in an interesting setting where they have tried to evaluate and embodied the agent on a data set called as half world data set, which is about evaluating the abilities of the agent to follow a given task given the text word environment and a visual equivalent of it. So they have broken down their approach into different modules. One is a planner module which is also can LLM. So it takes in the instruction, it tries to convert that into a plan that the agent needs to follow. And then there is an eliminator. So eliminator based on the visual input and also on the visual input of what is there in the environment. The eliminator tries to eliminate whatever that the agent sees and what it needs to focus on. And these the actor does the action and the tracker tracks whether a given task is finished or not. Once it's finished, it is updating the progress. So this overall approach is in a way it's similar to what has been followed in auto GPT and Baby Aga and all those applications. So if you see here first for the task, heat some apple and put it in the fridge. The LLMs first comes up with a plan. Take an apple, heat the apple, place the apple in fridge, and then an eliminator eliminates what are the things that are not important for this particular task to be completed. And an actor picks up the, excuse me, the actor picks up the action that is more suitable for that particular task. And then a tracker tracks the progress of it. Another interesting paper that does this iterative prompting is describe, explain, plan and select. So in this paper, the authors have tried to use an LLM to solve or to play Minecraft. So minecraft, as you'd know, it's an open ended game. So they have used llms to play the minecraft. So here again, similar to what we saw before, they have split the approach into different modules. One is planner module, selector module and explainer module, and then a describer module. So the first, for example for these task how to mine one diamond from scratch. The planner module comes up with a set of plans of what are the tasks that needs to be done by the agent. So here from the different set of ushot examples in the ground truth plan, the planner would come up with an actual plan that needs to be executed. And then the selector, based on its knowledge of the environment, it selects in that given step what a goal that needs to be achieved. First based on prioritizing the different tasks involved. And then that particular task is executed by an executor. And that result of the executor is given as a description by the descriptor. So it says, if it finishes a goal, it says I finished goal g zero, and then the selector goes to the next task or next goal and so on. So this is done recursively, this is done by an LLM, it is prone to failures. So when a particular plan has failed, so these descriptor says I fail on this particular goal. And then it also gives the details of the environment. So based on that, an explainer explains what could have actually gone wrong and it explains what needs to be done. And then that goes to the planner. So planner then does the replanning again. Replanning again. And then this process continues until the final objective is met. So this is again a very interesting paper. I would urge the audience to go to the GitHub repo and go through there. They have done a wonderful implementation of their approach. So until now we have seen how recursive and iterative prompting can be used to elicit reason. So now the recent advancements have been enabled these LLMs to use tools. So that comes in handy to make the model even more accurate when it comes to reasoning and planning. So you'll see a few examples of how the tool usage is implemented in some of the literature work. So here the paper is react, reason and hacked. So in these paper, the authors, Yoert hall had broken down a reasoning question to use a tool like searching Wikipedia or looking up Wikipedia, and then do and come up with an answer based on that. So if you see here for this reasoning question, when the model tries to come up with a direct answer, it gets it wrong. Even with cot, this is getting it wrong because for this particular question, aside from Apple remote, what other devices can control the program Apple remote was originally designed to interact with? It needs an external information for the model to be relied on. So both these approaches are failing there. So in act approach, what they do is they come up with different set of, so they have two variations. One is act only approach. So in act only approach they come up with different actions that needs to be taken and then they come up with an answer. Then we have react which is based on reason and act which is the actual paper. So first they come up with a different thought and what act needs to be done, action needs to be done. And what is the observation that is done from this action. So based on that, that gets passed on to the next thought and that is used to do the second act, and then observation and so on. Finally they come up with these can answer. So this way they iteratively prompt it. First thought one is done and these act one action is generated based on that. Once we have this, once the model generates search the apple remote, the keyword is used to search the Wikipedia and then an observation is appended to the generation. So based on that a thought two is done until the model generates finish as one of the actions. So this is react, reason and act. Another paper that uses tools is camelion plug and play compositional reasoning with large language models. So here the authors have used tool based reasoning and compulsion for answering science question answers as well as to answer table based word problems. So here in this science based question answering or common sense question answering, here the image is given and the question is what is the direction of this push? And then an image is given and there is a question related to this. So the LLM first tries to come up with or decomposes this problem into set of tools that it needs to call. And then there is a separate set of prompts that are available for each of these tools, which gets executed sequentially to invoke that particular tool and get these answers that get appended to the original prompt. And these the process continues to get the final answer. For example, for this question where we have image and then a set of options, which is the main persuasive apple used in this ad. So this particular image is of an ADC. Paper plates now carry the Sierra Club seal of approval. So it's an ad. And then we have different options whether this ad conveys petals, ethos or logos, and these has different options. So what the Cameleon does is it first tries to call the text deductor as a tool, and the text deductor deducts the text and then it calls the knowledge retrieval. So knowledge retrieval based on the input that is there, the knowledge retrieval tries to come up with its inference of the overall perspective of question and the information that is available at that point. This is from call to an opena API. And then there is a solution generator which creates a descriptive solution of what needs to be done, and then the answer generator, which could be a rule based approach to generate the answers from the solution that was generated by the model. So these way this paper uses different tools, right from hugging face models, and then open a GPT models iteratively, and also the other models like text deductor to come up with a final solution. Another example is the tab math world problem solving data set, wherein for a given tabular example, for a question that was asked, the model has to come up with set of answers. So here the model again uses different tools like knowledge retriever to retrieve the knowledge that it has related to the question that was asked. And then it goes to the table verbalizer to verbalize what is there in the table, and then for all the calculation that is offloaded to a Python program and interpreter. And then we have a program verifier which verifies whether the program is correct, and then the program is executed and the answer is generated from it. So let's see how a prompt looks like for this. So here as we can see, the instruction or the prompt has different tools that the model can use and then it also has the context of what is the question and what are all the options that are there in these question and metadata of the image. And the model has to generate the set of modules that it has to call, set of steps that it has to execute, whether it has to execute text reductor first, knowledge retrieval solution generator and answer generator. So this way these model comes up with steps and then each of these separate tools are prompted to get the output. And then finally all these outputs from these individual tools are concatenated into one sequentially to generate the final answer. So yeah, that's pretty much I had for today. So there is a lot that has come out recently. Probably I might not have had a chance to include it here, like tool formers, hugging GPT and so on, but I would like to acknowledge the sources for this presentation. One is augment language models survey from Milan et al. And towards reasoning language models survey from Huang et al. And then these blog posts. I will also urge the audience to go through these papers and these blog posts if you'd like to learn further on this topic. So thank you very much for your attention and looking forward to hearing your feedback on these session and also to have discussions on this topic can be discussed today. Thank you.

Slides

Download slides (PDF)

See all 13 talks at this event!

Conf42 Machine Learning 2023 - Online

May 18 2023

Unlocking reasoning and planning abilities in Large language models

Video size:

Abstract

Summary

Transcript

Slides

Logesh Kumar Umapathi

Lead ML Research Engineer @ Saama Technologies

Join the community!

Featured event

2026

2025

Info

Conf42 Machine Learning 2023 - Online

May 18 2023

Unlocking reasoning and planning abilities in Large language models

Video size:

Abstract

Summary

Transcript

Slides

Logesh Kumar Umapathi

Lead ML Research Engineer @ Saama Technologies

Join the community!