{"id":1233,"date":"2025-04-16T02:01:38","date_gmt":"2025-04-16T02:01:38","guid":{"rendered":"https:\/\/violethoward.com\/new\/when-ai-reasoning-goes-wrong-microsoft-research-shows-more-tokens-can-mean-more-problems\/"},"modified":"2025-04-16T02:01:38","modified_gmt":"2025-04-16T02:01:38","slug":"when-ai-reasoning-goes-wrong-microsoft-research-shows-more-tokens-can-mean-more-problems","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/when-ai-reasoning-goes-wrong-microsoft-research-shows-more-tokens-can-mean-more-problems\/","title":{"rendered":"When AI reasoning goes wrong: Microsoft Research shows more tokens can mean more problems"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>Large language models (LLMs) are increasingly capable of complex reasoning through \u201cinference-time scaling,\u201d a set of techniques that allocate more computational resources during inference to generate answers. However, a new study from Microsoft Research reveals that the effectiveness of these scaling methods isn\u2019t universal. Performance boosts vary significantly across different models, tasks and problem complexities.<\/p>\n\n\n\n<p>The core finding is that simply throwing more compute at a problem during inference doesn\u2019t guarantee better or more efficient results. The findings can help enterprises better understand cost volatility and model reliability as they look to integrate advanced AI reasoning into their applications.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-putting-scaling-methods-to-the-test\">Putting scaling methods to the test<\/h2>\n\n\n\n<p>The Microsoft Research team conducted an extensive empirical analysis across nine state-of-the-art foundation models. This included both \u201cconventional\u201d models like GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Pro and Llama 3.1 405B, as well as models specifically fine-tuned for enhanced reasoning through inference-time scaling. This included OpenAI\u2019s o1 and o3-mini, Anthropic\u2019s Claude 3.7 Sonnet, Google\u2019s Gemini 2 Flash Thinking, and DeepSeek R1.<\/p>\n\n\n\n<p>They evaluated these models using three distinct inference-time scaling approaches:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Standard Chain-of-Thought (CoT):<\/strong> The basic method where the model is prompted to answer step-by-step.<\/li>\n\n\n\n<li><strong>Parallel Scaling:<\/strong> the model generates multiple independent answers for the same question and uses an aggregator (like majority vote or selecting the best-scoring answer) to arrive at a final result.<\/li>\n\n\n\n<li><strong>Sequential Scaling: <\/strong>The model iteratively generates an answer and uses feedback from a critic (potentially from the model itself) to refine the answer in subsequent attempts.<\/li>\n<\/ol>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1192\" height=\"1482\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?w=483\" alt=\"\" class=\"wp-image-3004448\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png 1192w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=300,373 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=768,955 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=483,600 483w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=400,497 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=750,932 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=578,719 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_70ff62.png?resize=930,1156 930w\" sizes=\"(max-width: 1192px) 100vw, 1192px\"\/><\/figure><\/div>\n\n\n<p>These approaches were tested on eight challenging benchmark datasets covering a wide range of tasks that benefit from step-by-step problem-solving: math and STEM reasoning (AIME, Omni-MATH, GPQA), calendar planning (BA-Calendar), NP-hard problems (3SAT, TSP), navigation (Maze) and spatial reasoning (SpatialMap).<\/p>\n\n\n\n<p>Several benchmarks included problems with varying difficulty levels, allowing for a more nuanced understanding of how scaling behaves as problems become harder.<\/p>\n\n\n\n<p>\u201cThe availability of difficulty tags for Omni-MATH, TSP, 3SAT, and BA-Calendar enables us to analyze how accuracy and token usage scale with difficulty in inference-time scaling, which is a perspective that is still underexplored,\u201d the researchers wrote in the paper detailing their findings.<\/p>\n\n\n\n<p>The researchers evaluated the Pareto frontier of LLM reasoning by analyzing both accuracy and the computational cost (i.e., the number of tokens generated). This helps identify how efficiently models achieve their results.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1244\" height=\"584\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?w=800\" alt=\"Inference-time scaling pareto\" class=\"wp-image-3004447\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png 1244w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=300,141 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=768,361 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=800,376 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=400,188 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=750,352 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=578,271 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_ebef09.png?resize=930,437 930w\" sizes=\"auto, (max-width: 1244px) 100vw, 1244px\"\/><figcaption class=\"wp-element-caption\"><em>Inference-time scaling Pareto frontier Credit: arXiv<\/em><\/figcaption><\/figure>\n\n\n\n<p>They also introduced the \u201cconventional-to-reasoning gap\u201d measure, which compares the best possible performance of a conventional model (using an ideal \u201cbest-of-N\u201d selection) against the average performance of a reasoning model, estimating the potential gains achievable through better training or verification techniques.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-more-compute-isn-t-always-the-answer\">More compute isn\u2019t always the answer<\/h2>\n\n\n\n<p>The study provided several crucial insights that challenge common assumptions about inference-time scaling:<\/p>\n\n\n\n<p><strong>Benefits vary significantly:<\/strong> While models tuned for reasoning generally outperform conventional ones on these tasks, the degree of improvement varies greatly depending on the specific domain and task. Gains often diminish as problem complexity increases. For instance, performance improvements seen on math problems didn\u2019t always translate equally to scientific reasoning or planning tasks.<\/p>\n\n\n\n<p><strong>Token inefficiency is rife:<\/strong> The researchers observed high variability in token consumption, even between models achieving similar accuracy. For example, on the AIME 2025 math benchmark, DeepSeek-R1 used over five times more tokens than Claude 3.7 Sonnet for roughly comparable average accuracy.\u00a0<\/p>\n\n\n\n<p><span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\"><strong>More tokens do not lead to higher accuracy:<\/strong>\u00a0Contrary to the intuitive idea that longer reasoning chains mean better reasoning, the study found this isn\u2019t always true.<\/span> \u201cSurprisingly, we also observe that longer generations relative to the same model can sometimes be an indicator of models struggling, rather than improved reflection,\u201d the paper states. \u201cSimilarly, when comparing different reasoning models, higher token usage is not always associated with better accuracy. These findings motivate the need for more purposeful and cost-effective scaling approaches.\u201d<\/p>\n\n\n\n<p><strong>Cost nondeterminism:<\/strong> Perhaps most concerning for enterprise users, repeated queries to the same model for the same problem can result in highly variable token usage. This means the cost of running a query can fluctuate significantly, even when the model consistently provides the correct answer.\u00a0<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1248\" height=\"268\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?w=800\" alt=\"variance in model outputs\" class=\"wp-image-3004449\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png 1248w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=300,64 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=768,165 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=800,172 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=400,86 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=750,161 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=578,124 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_909607.png?resize=930,200 930w\" sizes=\"auto, (max-width: 1248px) 100vw, 1248px\"\/><figcaption class=\"wp-element-caption\"><em>Variance in response length (spikes show smaller variance) Credit: arXiv<\/em><\/figcaption><\/figure><\/div>\n\n\n<p><strong>The potential in verification mechanisms:<\/strong> Scaling performance consistently improved across all models and benchmarks when simulated with a \u201cperfect verifier\u201d (using the best-of-N results).\u00a0<\/p>\n\n\n\n<p><strong>Conventional models sometimes match reasoning models:<\/strong> By significantly increasing inference calls (up to 50x more in some experiments), conventional models like GPT-4o could sometimes approach the performance levels of dedicated reasoning models, particularly on less complex tasks. However, these gains diminished rapidly in highly complex settings, indicating that brute-force scaling has its limits.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1204\" height=\"320\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?w=800\" alt=\"GPT-4o inference-time scaling\" class=\"wp-image-3004450\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png 1204w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=300,80 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=768,204 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=800,213 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=400,106 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=750,199 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=578,154 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=930,247 930w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_fe40bb.png?resize=1200,320 1200w\" sizes=\"auto, (max-width: 1204px) 100vw, 1204px\"\/><figcaption class=\"wp-element-caption\"><em>On some tasks, the accuracy of GPT-4o continues to improve with parallel and sequential scaling. Credit: arXiv<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-implications-for-the-enterprise\"><strong>Implications for the enterprise<\/strong><\/h2>\n\n\n\n<p>These findings carry significant weight for developers and enterprise adopters of LLMs. The issue of \u201ccost nondeterminism\u201d is particularly stark and makes budgeting difficult. As the researchers point out, \u201cIdeally, developers and users would prefer models for which the standard deviation on token usage per instance is low for cost predictability.\u201d<\/p>\n\n\n\n<p>\u201cThe profiling we do in [the study] could be useful for developers as a tool to pick which models are less volatile for the same prompt or for different prompts,\u201d Besmira Nushi, senior principal research manager at Microsoft Research, told VentureBeat. \u201cIdeally, one would want to pick a model that has low standard deviation for correct inputs.\u201d\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1206\" height=\"740\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?w=800\" alt=\"\" class=\"wp-image-3004451\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png 1206w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=300,184 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=768,471 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=800,491 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=400,245 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=750,460 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=578,355 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image_84c14a.png?resize=930,571 930w\" sizes=\"auto, (max-width: 1206px) 100vw, 1206px\"\/><figcaption class=\"wp-element-caption\"><em>Models that peak blue to the left consistently generate the same number of tokens at the given task Credit: arXiv<\/em><\/figcaption><\/figure>\n\n\n\n<p>The study also provides good insights into the correlation between a model\u2019s accuracy and response length. For example, the following diagram shows that math queries above ~11,000 token length have a very slim chance of being correct, and those generations should either be stopped at that point or restarted with some sequential feedback. However, Nushi points out that models allowing these post hoc mitigations also have a cleaner separation between correct and incorrect samples.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"559\" height=\"432\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image001.png\" alt=\"\" class=\"wp-image-3004452\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image001.png 559w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image001.png?resize=300,232 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/04\/image001.png?resize=400,309 400w\" sizes=\"auto, (max-width: 559px) 100vw, 559px\"\/><\/figure>\n\n\n\n<p>\u201cUltimately, it is also the responsibility of model builders to think about reducing accuracy and cost non-determinism, and we expect a lot of this to happen as the methods get more mature,\u201d Nushi said. \u201cAlongside cost nondeterminism, accuracy nondeterminism also applies.\u201d<\/p>\n\n\n\n<p>Another important finding is the consistent performance boost from perfect verifiers, which highlights a critical area for future work: building robust and broadly applicable verification mechanisms.\u00a0<\/p>\n\n\n\n<p>\u201cThe availability of stronger verifiers can have different types of impact,\u201d Nushi said, such as improving foundational training methods for reasoning. \u201cIf used efficiently, these can also shorten the reasoning traces.\u201d<\/p>\n\n\n\n<p>Strong verifiers can also become a central part of enterprise agentic AI solutions. Many enterprise stakeholders already have such verifiers in place, which may need to be repurposed for more agentic solutions, such as SAT solvers, logistic validity checkers, etc.\u00a0<\/p>\n\n\n\n<p>\u201cThe questions for the future are how such existing techniques can be combined with AI-driven interfaces and what is the language that connects the two,\u201d Nushi said. \u201cThe necessity of connecting the two comes from the fact that users will not always formulate their queries in a formal way, they will want to use a natural language interface and expect the solutions in a similar format or in a final action (e.g. propose a meeting invite).\u201d<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/when-ai-reasoning-goes-wrong-microsoft-research-shows-more-tokens-can-mean-more-problems\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Large language models (LLMs) are increasingly capable of complex reasoning through \u201cinference-time scaling,\u201d a set of techniques that allocate more computational resources during inference to generate answers. However, a new study from Microsoft Research reveals that [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1234,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1233","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/04\/LLM-reasoning.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1233"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1233\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1234"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 02:35:32 UTC -->