Apple research claims popular AI models fail at hard reasoning: Why does it matter?

HIGHLIGHTS

Apple study found AI reasoning collapses on hard puzzles, revealing brittle LRM performance

AI models "give up" on complex tasks, shortening reasoning chains before reaching solutions

Synthetic benchmarks, though valuable, overstate AI limitations by ignoring real-world tools

Apple research claims popular AI models fail at hard reasoning: Why does it matter?

Over the weekend, Apple released new research that accuses most advanced generative AI models from the likes of OpenAI, Google and Anthropic of failing to handle tough logical reasoning problems. Apple’s researchers claim to prove how most large reasoning models (or LRMs) simply “give up” when tasked with hard puzzle solving tasks, thereby exposing a major pitfall in GenAI’s reasoning capabilities – as they exist for most parts in most LLM-based chatbots we’ve all gotten used to over the past couple of years.

In their recent paper, “The Illusion of Thinking,” Apple’s researchers pull back the curtain on how large reasoning models (LRMs) completely mess up on difficult reasoning tasks. Basing their paper on GenAI’s ability to solve certain algorithmic puzzles, Apple’s researchers paint a stark picture – that these models start strong on easy and medium-difficulty problems, then simply “give up” once complexity crosses a threshold. 

Apple logo mystery solved Designer Rob Janoff makes it clear

Also read: Humanity’s Last Exam Explained – The ultimate AI benchmark that sets the tone of our AI future

But before we declare AI reasoning officially broken, it’s worth asking – how much of this collapse reflects reality, and how much is an artifact of the puzzles themselves?

What exactly is Apple saying about AI reasoning

Apple’s critique of benchmarks like GSM8K and MATH starts with a valid point. The paper says that too often models memorize leaked test data, inflating our sense of their reasoning prowess. In order to combat this, Apple devised four classic algorithmic puzzles – Tower of Hanoi, River Crossing, Blocks World, and Checker Jumping – each scalable in precise steps while holding the basic logic constant. This lets them track not just whether a model gets the right answer, but also the length and structure of its “chain-of-thought” token traces.

By testing top-tier LRMs – OpenAI’s o-series, Anthropic’s Claude, Google’s Gemini 2.5, among others – Apple researchers saw a consistent pattern. First, in low-complexity puzzles, vanilla next-token models sometimes outdo “reasoning” variants. Secondly, in medium-complexity puzzles, chain-of-thought prompting gives LRMs an edge. However, once you enter the high-complexity problem solving, accuracy crashes to near zero, no matter how many tokens you throw at the problem.

Also read: Mark Zuckerberg says AI will write most of Meta’s AI code by 2026

deepseek-r1-meta-llama-3-2-openai-o1-chatgpt-model-comparison

The most alarming result is what Apple calls the “accuracy cliff.” Once a puzzle’s compositional steps exceed a hidden breakpoint – which is unique to each AI model – success rates plummet instantly. Equally telling is what happens to the models’ token-level reasoning traces. Rather than lengthening their chains to tackle harder steps, LRMs start shortening them – a clear “giving up” heuristic, as Apple frames it.

This echoes earlier stress-test research on SAT solvers and arithmetic models, where sharp performance drops past a certain complexity have been reported – especially related to mathematical problems. And independent code-generation studies have come to the same conclusion in the past – when faced with too many lines of logic, GenAI models produce plausible but incomplete code, where in order to be concise, they might be leaving out important details, thereby making them less accurate as a result.

Apple’s puzzle approach effectively demonstrates that chain-of-thought prompting has clear scaling limitations, as performance gains diminish rapidly beyond moderate difficulty levels, while their use of focused synthetic tasks ensures benchmark integrity by avoiding the contamination issues that plague popular evaluation suites where models may have memorized training data. Perhaps most significantly, their token-trace analysis reveals that these models don’t simply process slowly when facing complex problems – they actively reduce their own reasoning pathways when they detect they’re heading toward a dead end, suggesting a more sophisticated but potentially limiting form of self-regulation in AI reasoning processes.

Are puzzles reflective of real-world reasoning?

Here’s where the Apple paper risks overreach. Because whether you like it or not, algorithmic puzzles live in a vacuum, stripped of the rich context, domain heuristics, and external tools (think calculators, retrieval systems, or symbolic engines) that real-world tasks allow. 

Few of us solve problems solely by chaining logic in our heads these days –  we Google, we scribble, we offload heavy lifting to spreadsheets or math libraries. Hybrid architectures – think of retrieval-augmented (RAG) models – can dynamically fetch facts or calculate precisely, helping fill the gaps in general reasoning to more focused reasoning, and thereby plugging the gaps Apple’s puzzles evaluation expose. By focusing narrowly on standalone LRMs, the Apple paper sidesteps these more robust systems which are quickly becoming the norm.

Also read: Deepseek to Qwen: Top AI models released in 2025

Apple’s experiments reportedly use off-the-shelf models with minimal prompt optimization. But anyone who’s ever used ChatGPT or Gemini chatbot knows that careful prompt engineering or targeted follow-up commands can push output quality considerably higher, and it’s true in benchmarks as well. In other words, the reasoning collapse of an AI model that Apple’s alluding to might shift further up the complexity curve, rather than vanish entirely.

Also, interpreting shorter reasoning chains as outright failure can be problematic. Models often prune redundant intermediate steps when they suspect a pattern, aiming for a concise but correct response. In such a scenario, token economy isn’t necessarily a cry of defeat – it can also be a sign of increasing efficiency. We need finer metrics – perhaps measuring whether the pruned steps eliminate critical logical nodes or merely trim fluff – before diagnosing an exhaustion syndrome.

Reality check for the “Illusion of Thinking”

All said and done, Apple’s “The Illusion of Thinking” is a welcome reality check. It reminds us that shiny demos on sanitized benchmarks can lull us into overconfidence. The paper’s controlled puzzles unveil genuine cracks in standalone LLM reasoning, and its trace analyses offer a compelling new window into model behavior.

But it’s important also to note that these puzzles are not the final word on AI’s reasoning future. Real-world intelligence rarely reduces to pure logic chains, as retrieval skills, tools at hand, and human ingenuity all play a part when we’re trying to get something done. If we want AI that truly “thinks,” we must broaden our evaluation horizons, test hybrid systems in pragmatic scenarios, and refine our metrics to capture both depth and efficiency of reasoning.

Also read: WWDC 2025 could be Apple’s ‘AI gap year’ as it focuses on branding over breakthroughs

Jayesh Shinde

Jayesh Shinde

Executive Editor at Digit. Technology journalist since Jan 2008, with stints at Indiatimes.com and PCWorld.in. Enthusiastic dad, reluctant traveler, weekend gamer, LOTR nerd, pseudo bon vivant. View Full Profile

Digit.in
Logo
Digit.in
Logo