{"id":839,"date":"2025-03-27T22:20:22","date_gmt":"2025-03-27T22:20:22","guid":{"rendered":"https:\/\/violethoward.com\/new\/anthropic-scientists-expose-how-ai-actually-thinks-and-discover-it-secretly-plans-ahead-and-sometimes-lies\/"},"modified":"2025-03-27T22:20:22","modified_gmt":"2025-03-27T22:20:22","slug":"anthropic-scientists-expose-how-ai-actually-thinks-and-discover-it-secretly-plans-ahead-and-sometimes-lies","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/anthropic-scientists-expose-how-ai-actually-thinks-and-discover-it-secretly-plans-ahead-and-sometimes-lies\/","title":{"rendered":"Anthropic scientists expose how AI actually &#8216;thinks&#8217; \u2014 and discover it secretly plans ahead and sometimes lies"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<p>Anthropic has developed a new method for peering inside large language models like Claude, revealing for the first time how these AI systems process information and make decisions.<\/p>\n\n\n\n<p>The research, published today in two papers (available here and here), shows these models are more sophisticated than previously understood \u2014 they plan ahead when writing poetry, use the same internal blueprint to interpret ideas regardless of language, and sometimes even work backward from a desired outcome instead of simply building up from the facts.<\/p>\n\n\n\n<p>The work, which draws inspiration from neuroscience techniques used to study biological brains, represents a significant advance in AI interpretability. This approach could allow researchers to audit these systems for safety issues that might remain hidden during conventional external testing.<\/p>\n\n\n\n<p>\u201cWe\u2019ve created these AI systems with remarkable capabilities, but because of how they\u2019re trained, we haven\u2019t understood how those capabilities actually emerged,\u201d said Joshua Batson, a researcher at Anthropic, in an exclusive interview with VentureBeat. \u201cInside the model, it\u2019s just a bunch of numbers \u2014matrix weights in the artificial neural network.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-new-techniques-illuminate-ai-s-previously-hidden-decision-making-process\">New techniques illuminate AI\u2019s previously hidden decision-making process<\/h2>\n\n\n\n<p>Large language models like OpenAI\u2019s GPT-4o, Anthropic\u2019s Claude, and Google\u2019s Gemini have demonstrated remarkable capabilities, from writing code to synthesizing research papers. But these systems have largely functioned as \u201cblack boxes\u201d \u2014 even their creators often don\u2019t understand exactly how they arrive at particular responses.<\/p>\n\n\n\n<p>Anthropic\u2019s new interpretability techniques, which the company dubs \u201ccircuit tracing\u201d and \u201cattribution graphs,\u201d allow researchers to map out the specific pathways of neuron-like features that activate when models perform tasks. The approach borrows concepts from neuroscience, viewing AI models as analogous to biological systems.<\/p>\n\n\n\n<p>\u201cThis work is turning what were almost philosophical questions \u2014 \u2018Are models thinking? Are models planning? Are models just regurgitating information?\u2019 \u2014 into concrete scientific inquiries about what\u2019s literally happening inside these systems,\u201d Batson explained.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-claude-s-hidden-planning-how-ai-plots-poetry-lines-and-solves-geography-questions\">Claude\u2019s hidden planning: How AI plots poetry lines and solves geography questions<\/h2>\n\n\n\n<p>Among the most striking discoveries was evidence that Claude plans ahead when writing poetry. When asked to compose a rhyming couplet, the model identified potential rhyming words for the end of the next line before it began writing \u2014 a level of sophistication that surprised even Anthropic\u2019s researchers.<\/p>\n\n\n\n<p>\u201cThis is probably happening all over the place,\u201d Batson said. \u201cIf you had asked me before this research, I would have guessed the model is thinking ahead in various contexts. But this example provides the most compelling evidence we\u2019ve seen of that capability.\u201d<\/p>\n\n\n\n<p>For instance, when writing a poem ending with \u201crabbit,\u201d the model activates features representing this word at the beginning of the line, then structures the sentence to naturally arrive at that conclusion.<\/p>\n\n\n\n<p>The researchers also found that Claude performs genuine multi-step reasoning. In a test asking \u201cThe capital of the state containing Dallas is\u2026\u201d the model first activates features representing \u201cTexas,\u201d and then uses that representation to determine \u201cAustin\u201d as the correct answer. This suggests the model is actually performing a chain of reasoning rather than merely regurgitating memorized associations.<\/p>\n\n\n\n<p>By manipulating these internal representations \u2014 for example, replacing \u201cTexas\u201d with \u201cCalifornia\u201d \u2014 the researchers could cause the model to output \u201cSacramento\u201d instead, confirming the causal relationship.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-beyond-translation-claude-s-universal-language-concept-network-revealed\">Beyond translation: Claude\u2019s universal language concept network revealed<\/h2>\n\n\n\n<p>Another key discovery involves how Claude handles multiple languages. Rather than maintaining separate systems for English, French, and Chinese, the model appears to translate concepts into a shared abstract representation before generating responses.<\/p>\n\n\n\n<p>\u201cWe find the model uses a mixture of language-specific and abstract, language-independent circuits,\u201d the researchers write in their paper. When asked for the opposite of \u201csmall\u201d in different languages, the model uses the same internal features representing \u201copposites\u201d and \u201csmallness,\u201d regardless of the input language.<\/p>\n\n\n\n<p>This finding has implications for how models might transfer knowledge learned in one language to others, and suggests that models with larger parameter counts develop more language-agnostic representations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-when-ai-makes-up-answers-detecting-claude-s-mathematical-fabrications\">When AI makes up answers: Detecting Claude\u2019s mathematical fabrications<\/h2>\n\n\n\n<p>Perhaps most concerning, the research revealed instances where Claude\u2019s reasoning doesn\u2019t match what it claims. When presented with difficult math problems like computing cosine values of large numbers, the model sometimes claims to follow a calculation process that isn\u2019t reflected in its internal activity.<\/p>\n\n\n\n<p>\u201cWe are able to distinguish between cases where the model genuinely performs the steps they say they are performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue,\u201d the researchers explain.<\/p>\n\n\n\n<p>In one example, when a user suggests an answer to a difficult problem, the model works backward to construct a chain of reasoning that leads to that answer, rather than working forward from first principles.<\/p>\n\n\n\n<p>\u201cWe mechanistically distinguish an example of Claude 3.5 Haiku using a faithful chain of thought from two examples of unfaithful chains of thought,\u201d the paper states. \u201cIn one, the model is exhibiting \u2018bullshitting\u2018\u2026 In the other, it exhibits motivated reasoning.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-inside-ai-hallucinations-how-claude-decides-when-to-answer-or-refuse-questions\">Inside AI Hallucinations: How Claude decides when to answer or refuse questions<\/h2>\n\n\n\n<p>The research also provides insight into why language models hallucinate \u2014 making up information when they don\u2019t know an answer. Anthropic found evidence of a \u201cdefault\u201d circuit that causes Claude to decline to answer questions, which is inhibited when the model recognizes entities it knows about.<\/p>\n\n\n\n<p>\u201cThe model contains \u2018default\u2019 circuits that cause it to decline to answer questions,\u201d the researchers explain. \u201cWhen a model is asked a question about something it knows, it activates a pool of features which inhibit this default circuit, thereby allowing the model to respond to the question.\u201d<\/p>\n\n\n\n<p>When this mechanism misfires \u2014 recognizing an entity but lacking specific knowledge about it \u2014 hallucinations can occur. This explains why models might confidently provide incorrect information about well-known figures while refusing to answer questions about obscure ones.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-safety-implications-using-circuit-tracing-to-improve-ai-reliability-and-trustworthiness\">Safety implications: Using circuit tracing to improve AI reliability and trustworthiness<\/h2>\n\n\n\n<p>This research represents a significant step toward making AI systems more transparent and potentially safer. By understanding how models arrive at their answers, researchers could potentially identify and address problematic reasoning patterns.<\/p>\n\n\n\n<p>\u201cWe hope that we and others can use these discoveries to make models safer,\u201d the researchers write. \u201cFor example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors\u2014such as deceiving the user\u2014to steer them towards desirable outcomes, or to remove certain dangerous subject matter entirely.\u201d<\/p>\n\n\n\n<p>However, Batson cautions that the current techniques still have significant limitations. They only capture a fraction of the total computation performed by these models, and analyzing the results remains labor-intensive.<\/p>\n\n\n\n<p>\u201cEven on short, simple prompts, our method only captures a fraction of the total computation performed by Claude,\u201d the researchers acknowledge.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-future-of-ai-transparency-challenges-and-opportunities-in-model-interpretation\">The future of AI transparency: Challenges and opportunities in model interpretation<\/h2>\n\n\n\n<p>Anthropic\u2019s new techniques come at a time of increasing concern about AI transparency and safety. As these models become more powerful and more widely deployed, understanding their internal mechanisms becomes increasingly important.<\/p>\n\n\n\n<p>The research also has potential commercial implications. As enterprises increasingly rely on large language models to power applications, understanding when and why these systems might provide incorrect information becomes crucial for managing risk.<\/p>\n\n\n\n<p>\u201cAnthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse \u2014 including in scenarios of catastrophic risk,\u201d the researchers write.<\/p>\n\n\n\n<p>While this research represents a significant advance, Batson emphasized that it\u2019s only the beginning of a much longer journey. \u201cThe work has really just begun,\u201d he said. \u201cUnderstanding the representations the model uses doesn\u2019t tell us how it uses them.\u201d<\/p>\n\n\n\n<p>For now, Anthropic\u2019s circuit tracing offers a first tentative map of previously uncharted territory \u2014 much like early anatomists sketching the first crude diagrams of the human brain. The full atlas of AI cognition remains to be drawn, but we can now at least see the outlines of how these systems think.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/business\/anthropic-scientists-expose-how-ai-actually-thinks-and-discover-it-secretly-plans-ahead-and-sometimes-lies\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Anthropic has developed a new method for peering inside large language models like Claude, revealing for the first time how these AI systems process information and make decisions. The research, published today in two papers (available here and here), shows these models are more sophisticated than previously understood \u2014 they plan ahead when writing poetry, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":840,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-839","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/03\/nuneybits_Vector_art_of_circuit_tracing_in_a_brain_fluorescent__e5c3e1b0-80ed-49e4-b134-c5e84232180d.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/839","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=839"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/839\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/840"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=839"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=839"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=839"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-28 23:51:29 UTC -->