Original Paper: https://arxiv.org/abs/2409.12183

By: Zayne SpragueFangcong YinJuan Diego RodriguezDongwei JiangManya WadhwaPrasann SinghalXinyu ZhaoXi YeKyle MahowaldGreg Durrett

Abstract:

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs).

But for what kinds of tasks is this extra ``thinking'' really helpful?

To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models.

Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.

On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning.

Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs.

Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver.

Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs.

Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.


Summary Notes

image.png

Figure: Left: meta-analysis of CoT literature; each point is a reported delta of CoT over direct answering for some (LLM, task) pair. Right: average performance of using zero-shot CoT v.s. direct answer prompts across five general reasoning categories, covering 20 datasets with 14 LLMs evaluated on each. In both sets of results, math and other kinds of symbolic reasoning are the domains that consistently see substantial improvements from CoT (red dotted line indicates the mean improvement from CoT across experiments).

Introduction

In the realm of large language models (LLMs), the Chain-of-Thought (CoT) technique has emerged as a promising method for enhancing reasoning capabilities.

CoT has been lauded for its ability to provide human-readable explanations and improve language models' performance on complex tasks.

But is this technique equally effective across all problem domains?

A comprehensive study has delved into this question, analyzing over 100 papers and conducting experiments on 20 datasets with 14 different models.

This blog post explores the findings of this research, highlighting where CoT excels, where it falls short, and its implications for future applications.

The Role of CoT in Symbolic Reasoning

Chain-of-Thought prompting involves breaking down a problem into intermediate steps, allowing models to execute a sequence of logical operations.

This approach is particularly helpful in tasks that require mathematical, logical, or algorithmic reasoning.

The study reveals that CoT significantly improves performance in these domains, with tasks involving symbolic reasoning showing average improvements of up to 56.9% with CoT compared to 45.5% without it.

Methodologies and Key Findings