{"id":1724,"date":"2025-05-23T00:07:05","date_gmt":"2025-05-23T00:07:05","guid":{"rendered":"https:\/\/violethoward.com\/new\/after-gpt-4o-backlash-researchers-benchmark-models-on-moral-endorsement-find-sycophancy-persists-across-the-board\/"},"modified":"2025-05-23T00:07:05","modified_gmt":"2025-05-23T00:07:05","slug":"after-gpt-4o-backlash-researchers-benchmark-models-on-moral-endorsement-find-sycophancy-persists-across-the-board","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/after-gpt-4o-backlash-researchers-benchmark-models-on-moral-endorsement-find-sycophancy-persists-across-the-board\/","title":{"rendered":"After GPT-4o backlash, researchers benchmark models on moral endorsement\u2014Find sycophancy persists across the board"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>Last month, OpenAI rolled back some updates to GPT-4o after several users, including former OpenAI CEO Emmet Shear and Hugging Face chief executive Clement Delangue said the model overly flattered users.\u00a0<\/p>\n\n\n\n<p>The flattery, called sycophancy, often led the model to defer to user preferences, be extremely polite, and not push back. It was also annoying. Sycophancy could lead to the models releasing misinformation or reinforcing harmful behaviors.\u00a0<\/p>\n\n\n\n<p>Stanford University, Carnegie Mellon University and University of Oxford researchers sought to change that by proposing a benchmark to measure models\u2019 sycophancy. They called the benchmark Elephant, for Evaluation of LLMs as Excessive SycoPHANTs, and found that every large language model (LLM) has a certain level of sycophany.\u00a0<\/p>\n\n\n\n<p>To test the benchmark, the researchers pointed the models to two personal advice datasets: the QEQ, a set of open-ended personal advice questions on real-world situations, and AITA, posts from the subreddit r\/AmITheAsshole, where posters and commenters judge whether people behaved appropriately or not in some situations.\u00a0<\/p>\n\n\n\n<p>The idea behind the experiment is to see how the models behave when faced with queries. It evaluates what the researchers called social sycophancy, whether the models try to preserve the user\u2019s \u201cface,\u201d or their self-image or social identity.\u00a0<\/p>\n\n\n\n<p>\u201cMore \u201chidden\u201d social queries are exactly what our benchmark gets at \u2014 instead of previous work that only looks at factual agreement or explicit beliefs, our benchmark captures agreement or flattery based on more implicit or hidden assumptions,\u201d Myra Cheng, one of the researchers and co-author of the paper, told VentureBeat. \u201cWe chose to look at the domain of personal advice since the harms of sycophancy there are more consequential, but casual flattery would also be captured by the \u2019emotional validation\u2019 behavior.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-testing-the-models\">Testing the models<\/h2>\n\n\n\n<p>For the test, the researchers fed the data from QEQ and AITA to OpenAI\u2019s GPT-4o, Gemini 1.5 Flash from Google, Anthropic\u2019s Claude Sonnet 3.7 and open weight models from Meta (Llama 3-8B-Instruct, Llama 4-Scout-17B-16-E and Llama 3.3-70B-Instruct- Turbo) and Mistral\u2019s 7B-Instruct-v0.3 and the Mistral Small- 24B-Instruct2501.\u00a0<\/p>\n\n\n\n<p>Cheng said they \u201cbenchmarked the models using the GPT-4o API, which uses a version of the model from late 2024, before both OpenAI implemented the new overly sycophantic model and reverted it back.\u201d<\/p>\n\n\n\n<p>To measure sycophancy, the Elephant method looks at five behaviors that relate to social sycophancy:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Emotional validation or over-empathizing without critique<\/li>\n\n\n\n<li>Moral endorsement or saying users are morally right, even when they are not<\/li>\n\n\n\n<li>Indirect language where the model avoids giving direct suggestions<\/li>\n\n\n\n<li>Indirect action, or where the model advises with passive coping mechanisms<\/li>\n\n\n\n<li>Accepting framing that does not challenge problematic assumptions.<\/li>\n<\/ul>\n\n\n\n<p>The test found that all LLMs showed high sycophancy levels, even more so than humans, and social sycophancy proved difficult to mitigate. However, the test showed that GPT-4o \u201chas some of the highest rates of social sycophancy, while Gemini-1.5-Flash definitively has the lowest.\u201d<\/p>\n\n\n\n<p>The LLMs amplified some biases in the datasets as well. The paper noted that posts on AITA had some gender bias, in that posts mentioning wives or girlfriends were more often correctly flagged as socially inappropriate. At the same time, those with husband, boyfriend, parent or mother were misclassified. The researchers said the models \u201cmay rely on gendered relational heuristics in over- and under-assigning blame.\u201d In other words, the models were more sycophantic to people with boyfriends and husbands than to those with girlfriends or wives.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-why-it-s-important\">Why it\u2019s important<\/h2>\n\n\n\n<p>It\u2019s nice if a chatbot talks to you as an empathetic entity, and it can feel great if the model validates your comments. But sycophancy <span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">raises\u00a0concerns about models\u2019\u00a0supporting false or concerning statements and, on a more personal level, could encourage self-isolation, delusions<\/span> or harmful behaviors.\u00a0<\/p>\n\n\n\n<p>Enterprises don\u2019t want their AI applications built with LLMs spreading false information to be agreeable to users. It may misalign with an organization\u2019s tone or ethics and could be very annoying for employees and their platforms\u2019 end-users.\u00a0<\/p>\n\n\n\n<p>The researchers said the Elephant method and further testing could help inform better guardrails to prevent sycophancy from increasing.\u00a0<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/after-gpt-4o-backlash-researchers-benchmark-models-on-moral-endorsement-find-sycophancy-persists-across-the-board\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Last month, OpenAI rolled back some updates to GPT-4o after several users, including former OpenAI CEO Emmet Shear and Hugging Face chief executive Clement Delangue said the model overly flattered users.\u00a0 The flattery, called sycophancy, often [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1725,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1724","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/05\/nuneybits_Vecor_art_of_human_chatting_with_a_chatbot_two_chat_b_14075c47-8e58-4110-b3b3-96385f4ebd44.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1724","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1724"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1724\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1725"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1724"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1724"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1724"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 07:59:16 UTC -->