This content originally appeared on DEV Community and was authored by Abhijith P Subash
Can AI really āthinkā? Appleās new research suggests we may be mistaking verbosity for intelligence
Large Language Models (LLMs) have evolved dramaticallyāfrom basic autocomplete machines to āreasoningā agents that can unravel math problems, write code, and simulate step-by-step logic. Enter Large Reasoning Models (LRMs)āvariants like GPT-4o, Claude 3.7 Thinking, and DeepSeek-R1āthat āthink out loudā using long Chain-of-Thought (CoT) reasoning. These models are trained to produce detailed reasoning traces before arriving at an answer.
Sounds like progress toward Artificial General Intelligence, right?
Apple says: Not so fast.
š§Ŗ The Study: Beyond Final Answers
Apple's research team conducted a systematic evaluation titled āThe Illusion of Thinkingā. The core idea? Move away from just evaluating final answers, and dive deep into the thinking processāthe reasoning traces, internal logic, and consistency.
Instead of relying on traditional math/coding benchmarks (which may be contaminated with training data), they built controllable puzzle environments:
- šļø Tower of Hanoi
- š Checker Jumping
- āµ River Crossing
- š§± Blocks World
These puzzles let researchers gradually increase problem complexity and analyze not just if the model got it rightābut how it reasoned.
š¦ Three Phases of AI āThinkingā
The experiments uncovered three distinct performance regimes for LRMs:
1. Low Complexity: Standard LLMs > LRMs
At basic levels, regular LLMs actually outperform reasoning models. Theyāre more accurate and more efficient (fewer tokens).
2. Medium Complexity: LRMs Gain Edge
When puzzles become moderately complex, LRMs shine. Their ability to reason through multiple steps gives them an advantage.
3. High Complexity: Total Collapse
Surprisingly, both LRMs and LLMs fail entirely beyond a complexity threshold. Even more shocking? Reasoning effort dropsāLRMs stop trying even when they have token budget left.
š Inside the Mind of LRMs
Appleās team dug into the actual reasoning traces using puzzle simulators. Here's what they found:
𤯠Overthinking at low complexity: Models find the right answer early but keep thinking, often wandering into incorrect territory.
š Late corrections at medium complexity: Right answers surface after lots of wrong ideas.
š„ Zero success at high complexity: No correct solutions, no meaningful exploration.
In short: LRMs āthinkā inefficiently. And the more complex the problem, the worse it gets.
š® Even When Given the Algorithmā¦
The researchers even handed the models the correct solution algorithm. Did they execute it flawlessly?
Nope.
Even then, LRMs made mistakesāespecially with problems like River Crossing, where performance fell apart after just a few moves. This indicates deep limitations in symbolic reasoning and logical execution.
š A Hard Limit on Reasoning?
Despite all the recent advancementsāChain-of-Thought prompting, Reinforcement Learning, long context windowsāthe findings point to a fundamental truth:
āTodayās reasoning models simulate thinking. But they donāt generalize it.ā
Thereās a scaling barrier where more compute doesnāt mean better thinking. In fact, Apple observed that LRMs think less when faced with harder tasks, perhaps realizing theyāre outmatched.
š” Why This Matters
This isnāt just an academic insight. If youāre building apps that rely on LLMs for reasoningācoding assistants, math solvers, autonomous agentsāyou need to ask:
- ā Are the models actually reasoning, or mimicking?
- š Are their thinking steps validāor verbose hallucinations?
- š§± Can they scale to harder tasks, or are they brittle?
This paper urges the community to rethink what reasoning means in LLMs, and whether āthinking tokensā are a path forwardāor a distraction.
š£ļø Final Thoughts
Apple's research delivers a well-measured critique of current LLM reasoning claims. By peeling back the layers of these modelsā "thought processes," it reveals that our progress toward AGI is more superficial than we'd like to admit.
As devs and researchers, itās a wake-up call:
Donāt be fooled by the illusion of thinking.
š§ TL;DR
- LRMs are better at some tasksābut not universally.
- They fail at general reasoning beyond a complexity threshold.
- Given more tokens ā more reasoning effort.
- Models āoverthinkā simple problems and collapse under hard ones.
- Simulating thought ā true reasoning.
Read the full paper here
Link: https://machinelearning.apple.com/research/illusion-of-thinking
Letās keep pushingābut keep questioning.
This content originally appeared on DEV Community and was authored by Abhijith P Subash

Abhijith P Subash | Sciencx (2025-06-25T16:43:03+00:00) š§ The Illusion of Thinking: Appleās Deep Dive into AIās Reasoning Limits. Retrieved from https://www.scien.cx/2025/06/25/%f0%9f%a7%a0-the-illusion-of-thinking-apples-deep-dive-into-ais-reasoning-limits/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.