🔗 https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
1. Introduction
Apple researchers introduce Large Reasoning Models (LRMs)—language models designed to generate chain-of-thought. They note that despite improved benchmark results, it remains unclear whether these systems truly reason or just mimic it. Standard math/coding benchmarks suffer from contamination and obscure the actual internal reasoning patterns.
2. Related Works
Traditional benchmarks like GSM8K or code datasets test final-answer accuracy without examining intermediate rationales. The authors emphasize the need for controllable environments to assess both outcome and reasoning trace, avoiding leakage from training data.
3. Math and Puzzle Environments
A new set of synthetic, compositional puzzles (e.g., Tower of Hanoi, River Crossing) is introduced. These maintain consistent logical structures while varying complexity so the authors can analyze how model behavior changes with problem depth.
Puzzle tasks are designed to scale smoothly in difficulty (with increasing numbers of disks or river objects), enabling controlled experiments on performance and reasoning patterns.
4. Experiments & Results
The study compares LRMs vs. standard LLMs with identical inference budgets, in environments free from training data contamination.
There are three regimes of complexity:
Low Complexity: Standard LLMs sometimes outperform LRMs—they “overthink” simple tasks.
Medium Complexity: LRMs use chain-of-thought effectively and outperform.
High Complexity: Both collapse to near-zero accuracy—a “complete accuracy collapse.”
LRMs ramp up reasoning efforts (longer token chains) as problems get harder—until a threshold, after which their effort drops, indicating they “give up” even with sufficient token budget.
There are two open questions:
LRMs fail to leverage explicit algorithms—even when provided, accuracy barely improves.
Reasoning traces are inconsistent across puzzles, revealing limitations in exact computation.
5. The Growing Tool Ecosystem & Collapse of Reasoning Models
Advanced reasoning tools offer benefits only up to a point. The collapse in performance suggests inherent scalability limits in current chain-of-thought–style LRMs, prompting reflection on whether scaling up is enough.
6. Conclusion
The paper concludes that LRMs display an illusion of thinking: realistic-looking reasoning chains that fail under real complexity. Future work must focus on new paradigms—like combining neural nets with symbolic systems or dynamic computation strategies—to achieve robust AI reasoning.