Skip to content

Awesome deliberative prompting: How to ask LLMs to produce reliable reasoning and make reason-responsive decisions.

License

Notifications You must be signed in to change notification settings

logikon-ai/awesome-deliberative-prompting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Deliberative Prompting Awesome

How to ask Large Language Models (LLMs) to produce reliable reasoning and make reason-responsive decisions.

deliberation, n.

The action of thinking carefully about something, esp. in order to reach a decision; careful consideration; an act or instance of this. (OED)

Contents

Success Stories

Striking evidence for effectiveness of deliberative prompting.

  • πŸŽ“ The original "chain of though" (CoT) paper, first to give clear evidence that deliberative prompting works. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." 2022-01-28. [>paper]
  • πŸŽ“ Deliberative prompting improves ability of Google's LLMs to solve unseen difficult problems, and instruction-finetuned (Flan-) models are much better at it.
    • "Scaling Instruction-Finetuned Language Models." 2022-12-06. [>paper]
    • "PaLM 2 Technical Report." 2023-05-17. [>paper]
  • πŸŽ“ Deliberative prompting is highly effective for OpenAI's models (Text-Davinci-003, ChatGPT, GPT-4), increasing accuracy in many (yet not all) reasoning tasks in the EvalAGI benchmark. "AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models." 2023-04-13. [>paper]
  • πŸŽ“ Deliberative prompting unlocks latent cognitive skills and is more effective for bigger models. "Challenging BIG-Bench tasks and whether chain-of-thought can solve them." 2022-10-17. [>paper]
  • πŸŽ“ Experimentally introducing errors in CoT reasoning traces decreases decision accuracy, which provides indirect evidence for reason-responsiveness of LLMs. "Stress Testing Chain-of-Thought Prompting for Large Language Models." 2023-09-28. [>paper]
  • πŸŽ“ Reasoning (about retrieval candidates) improves RAG. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." 2023-10-17. [>paper]
  • πŸŽ“ Deliberative reading notes improve RAG. "Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models." 2023-11-15. [>paper]
  • πŸŽ“ Good reasoning (CoT) causes good answers (i.e., LLMs are reason-responsive). "Causal Abstraction for Chain-of-Thought Reasoning in Arithmetic Word Problems." 2023-12-07. [>paper]
  • πŸŽ“ Logical interpretation of internal layer-wise processing of reasoning tasks yields further evidence for reason-responsiveness. "Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Model." 2023-12-07. [>paper]
  • πŸŽ“ Reasoning about alternative drafts improves text generation. "Self-Evaluation Improves Selective Generation in Large Language Models." 2023-12-14. [>paper]
  • πŸŽ“ CoT with carefully retrieved, diverse reasoning demonstrations boosts multi-modal LLMs. "Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models." 2023-12-04. [>paper]
  • πŸŽ“ Effective multi-hop CoT for visual question answering. "II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering." 2024-02-16. [>paper]
  • πŸŽ“ πŸ‘©β€πŸ’» DPO on synthetic CoT traces increases reason-responsiveness of small LLMs. "Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning" 2024-02-23. [>paper] [>code]

Prompting Patterns and Strategies

Prompting strategies and patterns to make LLMs deliberate.

Beyond "Let's think step by step"

Instructing LLMs to reason (in a specific way).

  • πŸŽ“ Asking GPT-4 to provide a correct and a wrong answers boosts accuracy. "Large Language Models are Contrastive Reasoners." 2024-03-13. [>paper]
  • πŸ”₯πŸŽ“ Guided dynamic prompting increases GPT-4 CoT performance by up to 30 percentage points. "Structure Guided Prompt: Instructing Large Language Model in Multi-Step Reasoning by Exploring Graph Structure of the Text" 2024-02-20. [>paper]
  • πŸŽ“ Letting LLMs choose and combine reasoning strategies is cost-efficient and improves performance. "SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures." 2024-02-06. [>paper]
  • πŸŽ“ CoA: Produce an abstract reasoning trace first, and fill in the details (using tools) later. "Efficient Tool Use with Chain-of-Abstraction Reasoning." 2024-01-30. [>paper]
  • πŸŽ“ Reason over and over again until verification test is passed. "Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts." 2023-10-23. [>paper]
  • πŸŽ“ Generate multiple diverse deliberations, then synthesize those in a single reasoning path. "Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios." 2023-11-14. [>paper]
  • πŸŽ“ Survey of CoT regarding task types, prompt designs, and reasoning quality metrics. "Towards Better Chain-of-Thought Prompting Strategies: A Survey." 2023-10-08. [>paper]
  • πŸŽ“ Asking a LLM about a problem's broader context leads to better answers. "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." 2023-10-09. [>paper]
  • Weighing Pros and Cons: This universal deliberation paradigm can be implemented with LLMs.
    • πŸ‘©β€πŸ’» A {{guidance}} program that does: 1. Identify Options β†’ 2. Generate Pros and Cons β†’ 3. Weigh Reasons β†’ 4. Decide. [>code]
  • πŸŽ“ πŸ‘©β€πŸ’» Plan-and-Solve Prompting. "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models." 2023-05-06. [>paper] [>code]
  • πŸŽ“ Note-Taking. "Learning to Reason and Memorize with Self-Notes." 2023-05-01. [>paper]
  • πŸŽ“ Deliberate-then-Generate improves text quality. "Deliberate then Generate: Enhanced Prompting Framework for Text Generation." 2023-05-31. [>paper]
  • πŸŽ“ Make LLM spontaneously interleave reasoning and Q/A. "ReAct: Synergizing Reasoning and Acting in Language Models." 2022-10-06. [>paper]
  • πŸŽ“ 'Divide-and-Conquer' instructions substantially outperform standard CoT. "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models" 2022-05-21. [>paper]

Multi-Agent Deliberation

Let one (or many) LLMs simulate a free controversy.

  • πŸŽ“Β πŸ‘©β€πŸ’»Β Carefully selected open LLMs that iteratively review and improve their answers outperform GPT4-o. "Mixture-of-Agents Enhances Large Language Model Capabilities." 2024-06-10. [>paper] [>code]
  • πŸŽ“Β More elaborate and costly multi-agent-system designs are typically more effective, according to this review: "Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A." 2023-11-19. [>paper]
  • πŸŽ“Β Systematic peer review is even better than multi-agent debate. "Towards Reasoning in Large Language Models via Multi-Agent Peer Review Collaboration." 2023-11-14. [>paper]
  • πŸŽ“Β Collective critique and reflection reduce factual hallucinations and toxicity. "N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics." 2023-10-28. [>paper]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β Delphi-process with diverse LLMs is veristically more valuable than simple debating. "ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs." 2023-09-22. [>paper] [>code]
  • πŸŽ“Β Multi-agent debate increases cognitive diversity increases performance. "Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate." 2023-05-30. [>paper]
  • πŸŽ“Β Leverage wisdom of the crowd effects through debate simulation. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." 2023-05-23. [>paper]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β Emulate Socratic dialogue to collaboratively solve problems with multiple AI agents. "The Socratic Method for Self-Discovery in Large Language Models." 2023-05-05. [>blog] [>code]

Reflection and Meta-Cognition

Higher-order reasoning strategies that may improve first-order deliberation.

  • πŸŽ“ πŸ‘©β€πŸ’» Keeping track of general insights gained from CoT problem solving improves future accuracy and efficiency. "Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models." 2024-06-06. [>paper] [>code]
  • πŸŽ“ πŸ‘©β€πŸ’» Processing task in function of self-assessed difficulty boosts CoT effectiveness. "Divide and Conquer for Large Language Models Reasoning." 2024-01-10. [>paper] [>code]
  • πŸŽ“ πŸ‘©β€πŸ’»Β Reflecting on task allows LLM to autogenerate more effective instructions, demonstration, and reasoning traces. "Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models." 2023-10-11. [>paper] [>code]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β LLM-based AI Instructor devises effective first-order CoT-instructions (open source models improve by up to 20%). "Agent Instructs Large Language Models to be General Zero-Shot Reasoners." 2023-10-05. [>paper] [>code]
  • πŸŽ“Β πŸ‘©β€πŸ’»Β Clarifyβ†’Judgeβ†’Evaluateβ†’Confirmβ†’Qualify Paradigm. "Metacognitive Prompting Improves Understanding in Large Language Models." 2023-08-10. [>paper] [>code]
  • πŸŽ“ πŸ‘©β€πŸ’»Β Find-then-simulate-an-expert-for-this-problem Strategy. "Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm." 2021-02-15. [>paper] [>lmql]

Text Generation Techniques

Text generation techniques, which can be combined with prompting patterns and strategies.

  • πŸ”₯πŸŽ“ Iterative revision of reasoning in light of previous CoT traces improves accuracy by 10-20%. "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation". 2024-03-08. [>paper]
  • πŸŽ“ Pipeline for self-generating & choosing effective CoT few-shot demonstrations. "Universal Self-adaptive Prompting". 2023-05-24. [>paper]
  • πŸŽ“ More reasoning (= longer reasoning traces) is better. "The Impact of Reasoning Step Length on Large Language Models". 2024-01-10. [>paper]
  • πŸŽ“ Having (accordingly labeled) correct and erroneous (few-shot) reasoning demonstrations improves CoT. "Contrastive Chain-of-Thought Prompting." 2023-11-17. [>paper]
  • πŸŽ“ Better problem-solving and deliberation through few-shot trial-and-error (in-context RL). "Reflexion: Language Agents with Verbal Reinforcement Learning." 2023-03-20. [>paper]
  • πŸŽ“ External guides that constrain generation of reasoning improve accuracy by up to 35% on selected tasks. "Certified Reasoning with Language Models." 2023-06-06. [>paper]
  • πŸŽ“ πŸ‘©β€πŸ’» Highly effective beam search for generating complex, multi-step reasoning episodes. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." 2023-05-17. [>paper] [>code]
    • πŸ‘©β€πŸ’» A minimalistic implementation of Tree-of-Thoughts as plain prompt. [>code]
    • πŸ‘©β€πŸ’» An experimental LMQL implementation of Tree-of-Thoughts. [>code]
  • πŸŽ“ πŸ‘©β€πŸ’» LLM auto-generates diverse reasoning demonstration to-be-used in deliberative prompting. "Automatic Chain of Thought Prompting in Large Language Models." 2022-10-07. [>paper] [>code]

Self-Correction

Let LLMs self-correct their deliberation.

  • πŸŽ“Β Consistency between multiple CoT-traces is an indicator of reasoning reliability, which can be exploited for self-check / aggregation. "Can We Verify Step by Step for Incorrect Answer Detection?" 2024-02-16. [>paper]
  • πŸŽ“Β Turn LLMs into intrinsic self-checkers by appending self-correction steps to standard CoT traces for finetuning. "Small Language Model Can Self-correct." 2024-01-14. [>paper]
  • πŸŽ“Β Reinforced Self-Training improves retrieval-augmented multi-hop Q/A. "ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent." 2023-12-15. [>paper]
  • πŸŽ“Β Conditional self-correction depending on whether critical questions have been addressed in reasoning trace. "The ART of LLM Refinement: Ask, Refine, and Trust." 2023-11-14. [>paper]
  • πŸŽ“Β Iteratively refining reasoning given diverse feedback increases accuaracy by up tp 10% (ChatGPT). "MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models." 2023-10-19. [>paper]
  • πŸŽ“Β Instructing a model just to "review" its answer and "find problems" doesn't lead to effective self-correction. "Large Language Models Cannot Self-Correct Reasoning Yet." 2023-09-25. [>paper]
  • πŸŽ“Β LLMs can come up with, and address critical questions to improve their drafts. "Chain-of-Verification Reduces Hallucination in Large Language Models." 2023-09-25. [>paper]
  • πŸŽ“ LogiCoT: Self-check and revision after each CoT step improves performance (for selected tasks and models). "Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic." 2023-09-23. [>paper]
  • πŸŽ“ Excellent review about self-correcting LLMs, with application to unfaithful reasoning. "Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies." 2023-08-06. [>paper]

Reasoning Analytics

Methods for analysing LLM deliberation and assessing reasoning quality.

  • πŸŽ“πŸ‘©β€πŸ’» Comprehensive LLM-based reasoning analytics that breaks texts down into individual reasons. "DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models." 2024-01-04. [>paper] [>code]
  • πŸŽ“πŸ€— Highly performant, open LLM (T5-based) for inference verification. "Minds versus Machines: Rethinking Entailment Verification with Language Models." 2024-02-06. [>paper] [>model]
  • πŸŽ“πŸ‘©β€πŸ’» Test dataset for CoT evaluators. "A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains." 2023-11-23. [>paper] [>dataset]
  • πŸŽ“πŸ‘©β€πŸ’» Framework for evaluating reasoning chains by viewing them as informal proofs that derive the final answer. "ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness." 2023-11-23. [>paper] [>code]
  • πŸŽ“ GPT-4 is 5x better at predicting whether math reasoning is correct than GPT-3.5. "Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs." 2023-12-28. [>paper]
  • πŸŽ“ Minimalistic GPT-4 prompts for assessing reasoning quality. "SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation." 2023-09-29. [>paper] [>code]
  • πŸŽ“πŸ‘©β€πŸ’» Automatic, semantic-similarity based metrics for assessing CoT traces (redundancy, faithfulness, consistency, etc.). "ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning." 2023-09-12. [>paper]

Limitations, Failures, Puzzles

Things that don't work, or are poorly understood.

  • πŸŽ“ Structured generation risks to degrade reasoning quality and CoT effectiveness. "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models." 2024-08-05. [>paper]
  • πŸŽ“ Filler tokens can be as effective as sound reasoning traces for eliciting correct answers. "Let's Think Dot by Dot: Hidden Computation in Transformer Language Models." 2024-04-24. [>paper]
  • πŸ”₯πŸŽ“ Causal analysis shows that LLMs sometimes ignore CoT traces, but reason responsiveness increases with model size, and is shaped by fine-tuning. "LLMs with Chain-of-Thought Are Non-Causal Reasoners" 2024-02-25. [>paper]
  • πŸŽ“ Bad reasoning may lead to correct conclusions, hence better methods for CoT evaluation are needed. "SCORE: A framework for Self-Contradictory Reasoning Evaluation." 2023-11-16. [>paper]
  • πŸŽ“ LLMs may produce "encoded reasoning" that's unintelligable to humans, which may nullify any XAI gains from deliberative prompting. "Preventing Language Models From Hiding Their Reasoning." 2023-10-27. [>paper]
  • πŸŽ“ LLMs judge and decide in function of available arguments (reason-responsiveness), but are more strongly influenced by fallacious and deceptive reasons as compared to sound ones. "How susceptible are LLMs to Logical Fallacies?" 2023-08-18. [>paper]
  • πŸŽ“ Incorrect reasoning improves answer accuracy (nearly) as much as correct one. "Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting." 2023-07-20. [>paper]
  • πŸŽ“ Zeroshot CoT reasoning in sensitive domains increases a LLM's likelihood to produce harmful or undesirable output. "On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning." 2023-06-23. [>paper]
  • πŸŽ“ LLMs may systematically fabricate erroneous CoT rationales for wrong answers, NYU/Anthropic team finds. "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting." 2023-05-07. [>paper]
  • πŸŽ“ LLMs' practical deliberation is not robust, but easily let astray by re-wording scenarios. "Despite 'super-human' performance, current LLMs are unsuited for decisions about ethics and safety" 2022-12-13. [>paper]

Datasets

Datasets containing examples of deliberative prompting, potentially useful for training models / assessing their deliberation skills.

  • Instruction-following dataset augmented with "reasoning traces" generated by LLMs.
    • πŸŽ“ ORCA - Microsoft's original paper. "Orca: Progressive Learning from Complex Explanation Traces of GPT-4." 2023-06-05. [>paper]
    • πŸ‘©β€πŸ’» OpenOrca - Open source replication of ORCA datasets. [>dataset]
    • πŸ‘©β€πŸ’» Dolphin - Open source replication of ORCA datasets. [>dataset]
    • πŸŽ“ ORCA 2 - Improved Orca by Microsoft, e.g. with meta reasoning. "Orca 2: Teaching Small Language Models How to Reason." 2023-11-18. [>paper]
  • πŸŽ“πŸ‘©β€πŸ’» CoT Collection - 1.84 million reasoning traces for 1,060 tasks. "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning." [>paper] [>code]
  • πŸ‘©β€πŸ’» OASST1 - contains more than 200 instructions to generate pros and cons (acc. to nomic.ai's map). [>dataset]
  • πŸŽ“ LegalBench - a benchmark for legal reasoning in LLMs [>paper]
  • πŸŽ“πŸ‘©β€πŸ’» ThoughtSource - an open resource for data and tools related to chain-of-thought reasoning in large language models. [>paper] [>code]
  • πŸŽ“πŸ‘©β€πŸ’» Review with lots of hints to CoT relevant datasets. "Datasets for Large Language Models: A Comprehensive Survey" [>paper] [>code]
  • πŸ‘©β€πŸ’» Maxime Labonne's LLM datasets list [github]

Tools and Frameworks

Tools and Frameworks to implement deliberative prompting.

  • πŸ‘©β€πŸ’» LMQL - a programming language for language model interaction. [>site] GitHub Repo stars
    • πŸ‘©β€πŸ’» Interactive LMQL Playground [>site]
    • πŸŽ“ "Prompting Is Programming: A Query Language for Large Language Models." 2022-12-12. [>paper]
  • πŸ‘©β€πŸ’» {{guidance}} - a language for controlling large language models. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» outlines ~ - a language for guided text generation. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» DSPy - a programmatic interface to LLMs. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» llm-reasoners – A library for advanced large language model reasoning. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» ThinkGPT - framework and building blocks for chain-of-thought workflows. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» LangChain - a python library for building LLM chains and agents. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» PromptBench -a unified library for evaluating LLMS, inter alia effectiveness of CoT prompts. [>code] GitHub Repo stars
  • πŸ‘©β€πŸ’» SymbolicAI - a library for compositional differentiable programming with LLMs. [>code] GitHub Repo stars

Other Resources

More awesome and useful material.

  • πŸ“š Survey of Autonomous LLM Agents (continuously updated). [>site]
  • πŸ‘©β€πŸ’» LLM Dashboard - explore task-specific reasoning performance of open LLMs [>app]
  • πŸ“š Prompt Engineering Guide set up by DAIR. [>site]
  • πŸ“š ATLAS - principles and benchmark for systematic prompting [>code]
  • πŸ“š Deliberative Prompting Guide set up by Logikon. [>site]
  • πŸ“š Arguing with Arguments – recent and wonderful piece by H. Siegel discussing what it actually means to evaluate an argument. [>paper]