{"id":1965,"date":"2025-06-14T18:11:34","date_gmt":"2025-06-14T18:11:34","guid":{"rendered":"https:\/\/violethoward.com\/new\/do-reasoning-models-really-think-or-not-apple-research-sparks-lively-debate-response\/"},"modified":"2025-06-14T18:11:34","modified_gmt":"2025-06-14T18:11:34","slug":"do-reasoning-models-really-think-or-not-apple-research-sparks-lively-debate-response","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/do-reasoning-models-really-think-or-not-apple-research-sparks-lively-debate-response\/","title":{"rendered":"Do reasoning models really think or not? Apple research sparks lively debate, response"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>Apple\u2019s machine-learning group set off a rhetorical firestorm earlier this month with its release of \u201cThe Illusion of Thinking,\u201d a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning large language models (reasoning LLMs) such as OpenAI\u2019s \u201co\u201d series and Google\u2019s Gemini-2.5 Pro and Flash Thinking don\u2019t actually engage in independent \u201cthinking\u201d or \u201creasoning\u201d from generalized first principles learned from their training data.<\/p>\n\n\n\n<p>Instead, the authors contend, these reasoning LLMs are actually performing a kind of \u201cpattern matching\u201d and their apparent reasoning ability seems to fall apart once a task becomes too complex, suggesting that their architecture and performance is not a viable path to improving generative AI to the point that it is artificial generalized intelligence (AGI), which OpenAI defines as a model that outperforms humans at most economically valuable work, or superintelligence, AI even smarter than human beings can comprehend.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-act-now-come-discuss-the-latest-llm-advances-and-research-at-vb-transform-on-june-24-25-in-sf-limited-tickets-available-register-now\"><em>ACT NOW: Come discuss the latest LLM advances and research at VB Transform on June 24-25 in SF \u2014 limited tickets available<\/em>. REGISTER NOW<\/h2>\n\n\n\n<p>Unsurprisingly, the paper immediately circulated widely among the machine learning community on X and many readers\u2019 initial reactions were to declare that Apple had effectively disproven much of the hype around this class of AI: \u201cApple just proved AI \u2018reasoning\u2019\u00a0models like Claude, DeepSeek-R1, and o3-mini don\u2019t actually reason at all,\u201d declared Ruben Hassid, creator of EasyGen, an LLM-driven LinkedIn post auto writing tool. \u201cThey just memorize patterns really well.\u201d<\/p>\n\n\n\n<p>But now today, a new paper has emerged, the cheekily titled \u201cThe Illusion of The Illusion of Thinking\u201d \u2014 importantly, co-authored by a reasoning LLM itself, Claude Opus 4 and Alex Lawsen, a human being and independent AI researcher and technical writer \u2014 that includes many criticisms from the larger ML community about the paper and effectively argues that the methodologies and experimental designs the Apple Research team used in their initial work are fundamentally flawed.<\/p>\n\n\n\n<p>While we here at VentureBeat are not ML researchers ourselves and not prepared to say the Apple Researchers are wrong, the debate has certainly been a lively one and the issue about the capabilities of LRMs or reasoner LLMs compared to human thinking seems far from settled.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-the-apple-research-study-was-designed-and-what-it-found\">How the Apple Research study was designed \u2014 and what it found<\/h2>\n\n\n\n<p>Using four classic planning problems \u2014 Tower of Hanoi, Blocks World, River Crossing and Checkers Jumping \u2014 Apple\u2019s researchers designed a battery of tasks that forced reasoning models to plan multiple moves ahead and generate complete solutions. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" height=\"482\" width=\"800\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?w=800\" alt=\"\" class=\"wp-image-3012072\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg 1458w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=300,181 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=768,462 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=800,482 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=400,241 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=750,452 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=578,348 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/06\/Gs7EI0SXsAAVpXk.jpg?resize=930,560 930w\" sizes=\"(max-width: 800px) 100vw, 800px\"\/><\/figure>\n\n\n\n<p>These games were chosen for their long history in cognitive science and AI research and their ability to scale in complexity as more steps or constraints are added. Each puzzle required the models to not just produce a correct final answer, but to explain their thinking along the way using chain-of-thought prompting.<\/p>\n\n\n\n<p>As the puzzles increased in difficulty, the researchers observed a consistent drop in accuracy across multiple leading reasoning models. In the most complex tasks, performance plunged to zero. Notably, the length of the models\u2019 internal reasoning traces\u2014measured by the number of tokens spent thinking through the problem\u2014also began to shrink. Apple\u2019s researchers interpreted this as a sign that the models were abandoning problem-solving altogether once the tasks became too hard, essentially \u201cgiving up.\u201d<\/p>\n\n\n\n<p>The timing of the paper\u2019s release, just ahead of Apple\u2019s annual Worldwide Developers Conference (WWDC), added to the impact. It quickly went viral across X, where many interpreted the findings as a high-profile admission that current-generation LLMs are still glorified autocomplete engines, not general-purpose thinkers. This framing, while controversial, drove much of the initial discussion and debate that followed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-critics-take-aim-on-x\">Critics take aim on X<\/h2>\n\n\n\n<p>Among the most vocal critics of the Apple paper was ML researcher and X user @scaling01 (aka \u201cLisan al Gaib\u201d), who posted multiple threads dissecting the methodology. <\/p>\n\n\n\n<p>In one widely shared post, Lisan argued that the Apple team conflated token budget failures with reasoning failures, noting that \u201call models will have 0 accuracy with more than 13 disks simply because they cannot output that much!\u201d <\/p>\n\n\n\n<p>For puzzles like Tower of Hanoi, he emphasized, the output size grows exponentially, while the LLM context windows remain fixed, writing \u201cjust because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn\u2019t mean Tower of Hanoi is more difficult\u201d and convincingly showed that models like Claude 3 Sonnet and DeepSeek-R1 often produced algorithmically correct strategies in plain text or code\u2014yet were still marked wrong.<\/p>\n\n\n\n<p>Another post highlighted that even breaking the task down into smaller, decomposed steps worsened model performance\u2014not because the models failed to understand, but because they lacked memory of previous moves and strategy. <\/p>\n\n\n\n<p>\u201cThe LLM needs the history and a grand strategy,\u201d he wrote, suggesting the real problem was context-window size rather than reasoning.<\/p>\n\n\n\n<p>I raised another important grain of salt myself on X: Apple never benchmarked the model performance against human performance on the same tasks. \u201cAm I missing it, or did you not compare LRMs to human perf[ormance] on [the] same tasks?? If not, how do you know this same drop-off in perf doesn\u2019t happen to people, too?\u201d I asked the researchers directly in a thread tagging the paper\u2019s authors. I also emailed them about this and many other questions, but they have yet to respond.<\/p>\n\n\n\n<p>Others echoed that sentiment, noting that human problem solvers also falter on long, multistep logic puzzles, especially without pen-and-paper tools or memory aids. Without that baseline, Apple\u2019s claim of a fundamental \u201creasoning collapse\u201d feels ungrounded.<\/p>\n\n\n\n<p>Several researchers also questioned the binary framing of the paper\u2019s title and thesis\u2014drawing a hard line between \u201cpattern matching\u201d and \u201creasoning.\u201d <\/p>\n\n\n\n<p>Alexander Doria aka Pierre-Carl Langlais, an LLM trainer at energy efficient French AI startup Pleias, said the framing <em>misses the nuance<\/em>, arguing that models might be learning partial heuristics rather than simply matching patterns. <\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Ok I guess I have to go through that Apple paper. <\/p><p>My main issue is the framing which is super binary: &#8220;Are these models capable of generalizable reasoning, or are they leveraging different forms of pattern matching?&#8221; Or what if they only caught genuine yet partial heuristics. <a href=\"https:\/\/t.co\/GZE3eG7WlM\">pic.twitter.com\/GZE3eG7WlM<\/a><\/p>\u2014 Alexander Doria (@Dorialexander) <a href=\"https:\/\/twitter.com\/Dorialexander\/status\/1931624658387833263?ref_src=twsrc%5Etfw\">June 8, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<p>Ethan Mollick, the AI focused professor at University of Pennsylvania\u2019s Wharton School of Business,  called the idea that LLMs are \u201chitting a wall\u201d premature, likening it to similar claims about \u201cmodel collapse\u201d that didn\u2019t pan out.<\/p>\n\n\n\n<p>Meanwhile, critics like @arithmoquine were more cynical, suggesting that Apple\u2014behind the curve on LLMs compared to rivals like OpenAI and Google\u2014might be trying to lower expectations,\u201d coming up with research on \u201chow it\u2019s all fake and gay and doesn\u2019t matter anyway\u201d they quipped, pointing out Apple\u2019s reputation with now poorly performing AI products like Siri.<\/p>\n\n\n\n<p>In short, while Apple\u2019s study triggered a meaningful conversation about evaluation rigor, it also exposed a deep rift over how much trust to place in metrics when the test itself might be flawed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-a-measurement-artifact-or-a-ceiling\">A measurement artifact, or a ceiling?<\/h2>\n\n\n\n<p>In other words, the models may have understood the puzzles but ran out of \u201cpaper\u201d to write the full solution.<\/p>\n\n\n\n<p>\u201cToken limits, not logic, froze the models,\u201d wrote Carnegie Mellon researcher Rohan Paul in a widely shared thread summarizing the follow-up tests.<\/p>\n\n\n\n<p>Yet not everyone is ready to clear LRMs of the charge. Some observers point out that Apple\u2019s study still revealed three performance regimes \u2014 simple tasks where added reasoning hurts, mid-range puzzles where it helps, and high-complexity cases where both standard and \u201cthinking\u201d models crater.<\/p>\n\n\n\n<p>Others view the debate as corporate positioning, noting that Apple\u2019s own on-device \u201cApple Intelligence\u201d models trail rivals on many public leaderboards.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-rebuttal-the-illusion-of-the-illusion-of-thinking\">The rebuttal: \u201cThe Illusion of the Illusion of Thinking\u201d<\/h2>\n\n\n\n<p>In response to Apple\u2019s claims, a new paper titled \u201cThe Illusion of the Illusion of Thinking\u201d was released on arXiv by independent researcher and technical writer Alex Lawsen of the nonprofit Open Philanthropy, in collaboration with Anthropic\u2019s Claude Opus 4. <\/p>\n\n\n\n<p>The paper directly challenges the original study\u2019s conclusion that LLMs fail due to an inherent inability to reason at scale. Instead, the rebuttal presents evidence that the observed performance collapse was largely a by-product of the test setup\u2014not a true limit of reasoning capability.<\/p>\n\n\n\n<p>Lawsen and Claude demonstrate that many of the failures in the Apple study stem from token limitations. For example, in tasks like Tower of Hanoi, the models must print exponentially many steps \u2014 over 32,000 moves for just 15 disks \u2014 leading them to hit output ceilings. <\/p>\n\n\n\n<p>The rebuttal points out that Apple\u2019s evaluation script penalized these token-overflow outputs as incorrect, even when the models followed a correct solution strategy internally.<\/p>\n\n\n\n<p>The authors also highlight several questionable task constructions in the Apple benchmarks. Some of the River Crossing puzzles, they note, are mathematically unsolvable as posed, and yet model outputs for these cases were still scored. This further calls into question the conclusion that accuracy failures represent cognitive limits rather than structural flaws in the experiments.<\/p>\n\n\n\n<p>To test their theory, Lawsen and Claude ran new experiments allowing models to give compressed, programmatic answers. When asked to output a Lua function that could generate the Tower of Hanoi solution\u2014rather than writing every step line-by-line\u2014models suddenly succeeded on far more complex problems. This shift in format eliminated the collapse entirely, suggesting that the models didn\u2019t fail to reason. They simply failed to conform to an artificial and overly strict rubric.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-why-it-matters-for-enterprise-decision-makers\">Why it matters for enterprise decision-makers<\/h2>\n\n\n\n<p>The back-and-forth underscores a growing consensus: evaluation design is now as important as model design.<\/p>\n\n\n\n<p>Requiring LRMs to enumerate every step may test their printers more than their planners, while compressed formats, programmatic answers or external scratchpads give a cleaner read on actual reasoning ability.<\/p>\n\n\n\n<p>The episode also highlights practical limits developers face as they ship agentic systems\u2014context windows, output budgets and task formulation can make or break user-visible performance.<\/p>\n\n\n\n<p>For enterprise technical decision makers building applications atop reasoning LLMs, this debate is more than academic. It raises critical questions about where, when, and how to trust these models in production workflows\u2014especially when tasks involve long planning chains or require precise step-by-step output.<\/p>\n\n\n\n<p>If a model appears to \u201cfail\u201d on a complex prompt, the problem may not lie in its reasoning ability, but in how the task is framed, how much output is required, or how much memory the model has access to. This is particularly relevant for industries building tools like copilots, autonomous agents, or decision-support systems, where both interpretability and task complexity can be high.<\/p>\n\n\n\n<p>Understanding the constraints of context windows, token budgets, and the scoring rubrics used in evaluation is essential for reliable system design. Developers may need to consider hybrid solutions that externalize memory, chunk reasoning steps, or use compressed outputs like functions or code instead of full verbal explanations.<\/p>\n\n\n\n<p>Most importantly, the paper\u2019s controversy is a reminder that benchmarking and real-world application are not the same. Enterprise teams should be cautious of over-relying on synthetic benchmarks that don\u2019t reflect practical use cases\u2014or that inadvertently constrain the model\u2019s ability to demonstrate what it knows.<\/p>\n\n\n\n<p>Ultimately, the big takeaway for ML researchers is that before proclaiming an AI milestone\u2014or obituary\u2014make sure the test itself isn\u2019t putting the system in a box too small to think inside.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div><template id="fHxfzjKqVLwBwzyRrbFY"></template><\/script>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/do-reasoning-models-really-think-or-not-apple-research-sparks-lively-debate-response\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more Apple\u2019s machine-learning group set off a rhetorical firestorm earlier this month with its release of \u201cThe Illusion of Thinking,\u201d a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1966,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1965","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/06\/cfr0z3n_pen_drawing_black_ink_on_white_background_technical_sch_27601b38-b730-4322-98fb-371d1318a22f.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1965","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1965"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1965\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1966"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1965"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1965"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1965"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 09:44:16 UTC -->