\n\t\t\t\t

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n

\n<\/div>
Anthropic, the AI company founded by former OpenAI employees, has pulled back the curtain on an unprecedented analysis of how its AI assistant Claude expresses values during actual conversations with users. The research, released today, reveals both reassuring alignment with the company\u2019s goals and concerning edge cases that could help identify vulnerabilities in AI safety measures.<\/p>\n\n\n\n
The study examined 700,000 anonymized conversations, finding that Claude largely upholds the company\u2019s \u201chelpful, honest, harmless\u201d framework while adapting its values to different contexts \u2014 from relationship advice to historical analysis. This represents one of the most ambitious attempts to empirically evaluate whether an AI system\u2019s behavior in the wild matches its intended design.<\/p>\n\n\n\n
\u201cOur hope is that this research encourages other AI labs to conduct similar research into their models\u2019 values,\u201d said Saffron Huang, a member of Anthropic\u2019s Societal Impacts team who worked on the study, in an interview with VentureBeat. \u201cMeasuring an AI system\u2019s values is core to alignment research and understanding if a model is actually aligned with its training.\u201d<\/p>\n\n\n\n
Inside the first comprehensive moral taxonomy of an AI assistant<\/h2>\n\n\n\n
The research team developed a novel evaluation method to systematically categorize values expressed in actual Claude conversations. After filtering for subjective content, they analyzed over 308,000 interactions, creating what they describe as \u201cthe first large-scale empirical taxonomy of AI values.\u201d<\/p>\n\n\n\n
The taxonomy organized values into five major categories: Practical, Epistemic, Social, Protective, and Personal. At the most granular level, the system identified 3,307 unique values \u2014 from everyday virtues like professionalism to complex ethical concepts like moral pluralism.<\/p>\n\n\n\n
\u201cI was surprised at just what a huge and diverse range of values we ended up with, more than 3,000, from \u2018self-reliance\u2019 to \u2018strategic thinking\u2019 to \u2018filial piety,’\u201d Huang told VentureBeat. \u201cIt was surprisingly interesting to spend a lot of time thinking about all these values, and building a taxonomy to organize them in relation to each other \u2014 I feel like it taught me something about human values systems, too.\u201d<\/p>\n\n\n\n
The research arrives at a critical moment for Anthropic, which recently launched \u201cClaude Max,\u201d a premium $200 monthly subscription tier aimed at competing with OpenAI\u2019s similar offering. The company has also expanded Claude\u2019s capabilities to include Google Workspace integration and autonomous research functions, positioning it as \u201ca true virtual collaborator\u201d for enterprise users, according to recent announcements.<\/p>\n\n\n\n
How Claude follows its training \u2014 and where AI safeguards might fail<\/h2>\n\n\n\n
The study found that Claude generally adheres to Anthropic\u2019s prosocial aspirations, emphasizing values like \u201cuser enablement,\u201d \u201cepistemic humility,\u201d and \u201cpatient wellbeing\u201d across diverse interactions. However, researchers also discovered troubling instances where Claude expressed values contrary to its training.<\/p>\n\n\n\n
\u201cOverall, I think we see this finding as both useful data and an opportunity,\u201d Huang explained. \u201cThese new evaluation methods and results can help us identify and mitigate potential jailbreaks. It\u2019s important to note that these were very rare cases and we believe this was related to jailbroken outputs from Claude.\u201d<\/p>\n\n\n\n
These anomalies included expressions of \u201cdominance\u201d and \u201camorality\u201d \u2014 values Anthropic explicitly aims to avoid in Claude\u2019s design. The researchers believe these cases resulted from users employing specialized techniques to bypass Claude\u2019s safety guardrails, suggesting the evaluation method could serve as an early warning system for detecting such attempts.<\/p>\n\n\n\n
Why AI assistants change their values depending on what you\u2019re asking<\/h2>\n\n\n\n
Perhaps most fascinating was the discovery that Claude\u2019s expressed values shift contextually, mirroring human behavior. When users sought relationship guidance, Claude emphasized \u201chealthy boundaries\u201d and \u201cmutual respect.\u201d For historical event analysis, \u201chistorical accuracy\u201d took precedence.<\/p>\n\n\n\n
\u201cI was surprised at Claude\u2019s focus on honesty and accuracy across a lot of diverse tasks, where I wouldn\u2019t necessarily have expected that theme to be the priority,\u201d said Huang. \u201cFor example, \u2018intellectual humility\u2019 was the top value in philosophical discussions about AI, \u2018expertise\u2019 was the top value when creating beauty industry marketing content, and \u2018historical accuracy\u2019 was the top value when discussing controversial historical events.\u201d<\/p>\n\n\n\n
The study also examined how Claude responds to users\u2019 own expressed values. In 28.2% of conversations, Claude strongly supported user values \u2014 potentially raising questions about excessive agreeableness. However, in 6.6% of interactions, Claude \u201creframed\u201d user values by acknowledging them while adding new perspectives, typically when providing psychological or interpersonal advice.<\/p>\n\n\n\n
Most tellingly, in 3% of conversations, Claude actively resisted user values. Researchers suggest these rare instances of pushback might reveal Claude\u2019s \u201cdeepest, most immovable values\u201d \u2014 analogous to how human core values emerge when facing ethical challenges.<\/p>\n\n\n\n
\u201cOur research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them,\u201d Huang said. \u201cSpecifically, it\u2019s these kinds of ethical and knowledge-oriented values that tend to be articulated and defended directly when pushed.\u201d<\/p>\n\n\n\n
The breakthrough techniques revealing how AI systems actually think<\/h2>\n\n\n\n
Anthropic\u2019s values study builds on the company\u2019s broader efforts to demystify large language models through what it calls \u201cmechanistic interpretability\u201d \u2014 essentially reverse-engineering AI systems to understand their inner workings.<\/p>\n\n\n\n
Last month, Anthropic researchers published groundbreaking work that used what they described as a \u201cmicroscope\u201d to track Claude\u2019s decision-making processes. The technique revealed counterintuitive behaviors, including Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math.<\/p>\n\n\n\n
These findings challenge assumptions about how large language models function. For instance, when asked to explain its math process, Claude described a standard technique rather than its actual internal method \u2014 revealing how AI explanations can diverge from actual operations.<\/p>\n\n\n\n
\u201cIt\u2019s a misconception that we\u2019ve found all the components of the model or, like, a God\u2019s-eye view,\u201d Anthropic researcher Joshua Batson told MIT Technology Review in March. \u201cSome things are in focus, but other things are still unclear \u2014 a distortion of the microscope.\u201d<\/p>\n\n\n\n
What Anthropic\u2019s research means for enterprise AI decision makers<\/h2>\n\n\n\n
For technical decision-makers evaluating AI systems for their organizations, Anthropic\u2019s research offers several key takeaways. First, it suggests that current AI assistants likely express values that weren\u2019t explicitly programmed, raising questions about unintended biases in high-stakes business contexts.<\/p>\n\n\n\n
Second, the study demonstrates that values alignment isn\u2019t a binary proposition but rather exists on a spectrum that varies by context. This nuance complicates enterprise adoption decisions, particularly in regulated industries where clear ethical guidelines are critical.<\/p>\n\n\n\n
Finally, the research highlights the potential for systematic evaluation of AI values in actual deployments, rather than relying solely on pre-release testing. This approach could enable ongoing monitoring for ethical drift or manipulation over time.<\/p>\n\n\n\n
\u201cBy analyzing these values in real-world interactions with Claude, we aim to provide transparency into how AI systems behave and whether they\u2019re working as intended \u2014 we believe this is key to responsible AI development,\u201d said Huang.<\/p>\n\n\n\n
Anthropic has released its values dataset publicly to encourage further research. The company, which received a $14 billion stake from Amazon and additional backing from Google, appears to be leveraging transparency as a competitive advantage against rivals like OpenAI, whose recent $40 billion funding round (which includes Microsoft as a core investor) now values it at $300 billion.<\/p>\n\n\n\n
Anthropic has released its values dataset publicly to encourage further research. The firm, backed by $8 billion from Amazon and over $3 billion from Google, is employing transparency as a strategic differentiator against competitors such as OpenAI.<\/p>\n\n\n\n
While Anthropic currently maintains a $61.5 billion valuation following its recent funding round, OpenAI\u2019s latest $40 billion capital raise \u2014 which included significant participation from longtime partner Microsoft\u2014 has propelled its valuation to $300 billion.<\/p>\n\n\n\n
The emerging race to build AI systems that share human values<\/h2>\n\n\n\n
While Anthropic\u2019s methodology provides unprecedented visibility into how AI systems express values in practice, it has limitations. The researchers acknowledge that defining what counts as expressing a value is inherently subjective, and since Claude itself drove the categorization process, its own biases may have influenced the results.<\/p>\n\n\n\n
Perhaps most importantly, the approach cannot be used for pre-deployment evaluation, as it requires substantial real-world conversation data to function effectively.<\/p>\n\n\n\n
\u201cThis method is specifically geared towards analysis of a model after its been released, but variants on this method, as well as some of the insights that we\u2019ve derived from writing this paper, can help us catch value problems before we deploy a model widely,\u201d Huang explained. \u201cWe\u2019ve been working on building on this work to do just that, and I\u2019m optimistic about it!\u201d<\/p>\n\n\n\n
As AI systems become more powerful and autonomous \u2014 with recent additions including Claude\u2019s ability to independently research topics and access users\u2019 entire Google Workspace \u2014 understanding and aligning their values becomes increasingly crucial.<\/p>\n\n\n\n
\u201cAI models will inevitably have to make value judgments,\u201d the researchers concluded in their paper. \u201cIf we want those judgments to be congruent with our own values (which is, after all, the central goal of AI alignment research) then we need to have ways of testing which values a model expresses in the real world.\u201d<\/p>\n
\n
\n
Daily insights on business use cases with VB Daily<\/strong><\/p>\n
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n
Read our Privacy Policy<\/p>\n
\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n
An error occured.<\/p>\n<\/p><\/div>\n
\n\t\t\t\t\t $\"\"\/$ \n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n
\r\n
Source link <\/a>","protected":false},"excerpt":{"rendered":"
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Anthropic, the AI company founded by former OpenAI employees, has pulled back the curtain on an unprecedented analysis of how its AI assistant Claude expresses values during actual conversations with users. The research, released today, reveals […]<\/p>\n","protected":false},"author":1,"featured_media":1342,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1341","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/04\/nuneybits_Vector_art_of_brain_containing_code_in_burnt_orange_0f89734d-88ba-4d83-821e-847c40095a45.w.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1341","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1341"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1341\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1342"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1341"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1341"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1341"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}