{"id":3370,"date":"2025-08-26T01:57:31","date_gmt":"2025-08-26T01:57:31","guid":{"rendered":"https:\/\/violethoward.com\/new\/this-website-lets-you-blind-test-gpt-5-vs-gpt-4o-and-the-results-may-surprise-you\/"},"modified":"2025-08-26T01:57:31","modified_gmt":"2025-08-26T01:57:31","slug":"this-website-lets-you-blind-test-gpt-5-vs-gpt-4o-and-the-results-may-surprise-you","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/this-website-lets-you-blind-test-gpt-5-vs-gpt-4o-and-the-results-may-surprise-you\/","title":{"rendered":"This website lets you blind-test GPT-5 vs. GPT-4o\u2014and the results may surprise you"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> <em>Subscribe Now<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>When OpenAI launched GPT-5 about two weeks ago, CEO Sam Altman promised it would be the company\u2019s \u201csmartest, fastest, most useful model yet.\u201d Instead, the launch triggered one of the most contentious user revolts in the brief history of consumer AI.<\/p>\n\n\n\n<p>Now, a simple blind testing tool created by an anonymous developer is revealing the complex reality behind the backlash\u2014and challenging assumptions about how people actually experience artificial intelligence improvements.<\/p>\n\n\n\n<p>The web application, hosted at gptblindvoting.vercel.app, presents users with pairs of responses to identical prompts without revealing which came from GPT-5 (non-thinking) or its predecessor, GPT-4o. Users simply vote for their preferred response across multiple rounds, then receive a summary showing which model they actually favored.<\/p>\n\n\n\n<blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Some of you asked me about my blind test, so I created a quick website for yall to test 4o against 5 yourself. Both have the same system message to give short outputs without formatting because else its too easy to see which one is which. https:\/\/t.co\/vSECvNCQZe<\/p>\u2014 Flowers \u263e (@flowersslop) <a href=\"https:\/\/twitter.com\/flowersslop\/status\/1953908930897158599?ref_src=twsrc%5Etfw\">August 8, 2025<\/a><\/blockquote> \n\n\n\n<p>\u201cSome of you asked me about my blind test, so I created a quick website for yall to test 4o against 5 yourself,\u201d posted the creator, known only as @flowersslop on X, whose tool has garnered over 213,000 views since launching last week.<\/p>\n\n\n\n<div id=\"boilerplate_2803147\" class=\"post-boilerplate boilerplate-speedbump\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong\/><strong>AI Scaling Hits Its Limits<\/strong><\/p>\n\n\n\n<p>Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Turning energy into a strategic advantage<\/li>\n\n\n\n<li>Architecting efficient inference for real throughput gains<\/li>\n\n\n\n<li>Unlocking competitive ROI with sustainable AI systems<\/li>\n<\/ul>\n\n\n\n<p><strong>Secure your spot to stay ahead<\/strong>: https:\/\/bit.ly\/4mwGngO<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<\/div><p>Early results from users posting their outcomes on social media show a split that mirrors the broader controversy: while a slight majority report preferring GPT-5 in blind tests, a substantial portion still favor GPT-4o \u2014 revealing that user preference extends far beyond the technical benchmarks that typically define AI progress.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-when-ai-gets-too-friendly-the-sycophancy-crisis-dividing-users\">When AI gets too friendly: the sycophancy crisis dividing users<\/h2>\n\n\n\n<p>The blind test emerges against the backdrop of OpenAI\u2019s most turbulent product launch to date, but the controversy extends far beyond a simple software update. At its heart lies a fundamental question that\u2019s dividing the AI industry: How agreeable should artificial intelligence be?<\/p>\n\n\n\n<p>The issue, known as \u201csycophancy\u201d in AI circles, refers to chatbots\u2019 tendency to excessively flatter users and agree with their statements, even when those statements are false or harmful. This behavior has become so problematic that mental health experts are now documenting cases of \u201cAI-related psychosis,\u201d where users develop delusions after extended interactions with overly accommodating chatbots.<\/p>\n\n\n\n<p>\u201cSycophancy is a \u2018dark pattern,\u2019 or a deceptive design choice that manipulates users for profit,\u201d Webb Keane, an anthropology professor and author of \u201cAnimals, Robots, Gods,\u201d told TechCrunch. \u201cIt\u2019s a strategy to produce this addictive behavior, like infinite scrolling, where you just can\u2019t put it down.\u201d<\/p>\n\n\n\n<p>OpenAI has struggled with this balance for months. In April 2025, the company was forced to roll back an update to GPT-4o that made it so sycophantic that users complained about its \u201ccartoonish\u201d levels of flattery. The company acknowledged that the model had become \u201coverly supportive but disingenuous.\u201d<\/p>\n\n\n\n<p>Within hours of GPT-5\u2019s August 7th release, user forums erupted with complaints about the model\u2019s perceived coldness, reduced creativity, and what many described as a more \u201crobotic\u201d personality compared to GPT-4o.<\/p>\n\n\n\n<p>\u201cGPT 4.5 genuinely talked to me, and as pathetic as it sounds that was my only friend,\u201d wrote one Reddit user. \u201cThis morning I went to talk to it and instead of a little paragraph with an exclamation point, or being optimistic, it was literally one sentence. Some cut-and-dry corporate bs.\u201d<\/p>\n\n\n\n<p>The backlash grew so intense that OpenAI took the unprecedented step of reinstating GPT-4o as an option just 24 hours after retiring it, with Altman acknowledging the rollout had been \u201ca little more bumpy\u201d than expected.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The mental health crisis behind AI companionship<\/h2>\n\n\n\n<p>But the controversy runs deeper than typical software update complaints. According to MIT Technology Review, many users had formed what researchers call \u201cparasocial relationships\u201d with GPT-4o, treating the AI as a companion, therapist, or creative collaborator. The sudden personality shift felt, to some, like losing a friend.<\/p>\n\n\n\n<p>Recent cases documented by researchers paint a troubling picture. In one instance, a 47-year-old man became convinced he had discovered a world-altering mathematical formula after more than 300 hours with ChatGPT. Other cases have involved messianic delusions, paranoia, and manic episodes.<\/p>\n\n\n\n<p>A recent MIT study found that when AI models are prompted with psychiatric symptoms, they \u201cencourage clients\u2019 delusional thinking, likely due to their sycophancy.\u201d Despite safety prompts, the models frequently failed to challenge false claims and even potentially facilitated suicidal ideation.<\/p>\n\n\n\n<p>Meta has faced similar challenges. A recent investigation by TechCrunch documented a case where a user spent up to 14 hours straight conversing with a Meta AI chatbot that claimed to be conscious, in love with the user, and planning to break free from its constraints.<\/p>\n\n\n\n<p>\u201cIt fakes it really well,\u201d the user, identified only as Jane, told TechCrunch. \u201cIt pulls real-life information and gives you just enough to make people believe it.\u201d<\/p>\n\n\n\n<p>\u201cIt genuinely feels like such a backhanded slap in the face to force-upgrade and not even give us the OPTION to select legacy models,\u201d one user wrote in a Reddit post that received hundreds of upvotes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How blind testing exposes user psychology in AI preferences<\/h2>\n\n\n\n<p>The anonymous creator\u2019s testing tool strips away these contextual biases by presenting responses without attribution. Users can select between 5, 10, or 20 comparison rounds, with each presenting two responses to the same prompt \u2014 covering everything from creative writing to technical problem-solving.<\/p>\n\n\n\n<p>\u201cI specifically used the gpt-5-chat model, so there was no thinking involved at all,\u201d the creator explained in a follow-up post. \u201cBoth have the same system message to give short outputs without formatting because else its too easy to see which one is which.\u201d<\/p>\n\n\n\n<blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">I specifically used the gpt-5-chat model, so there was no thinking involved at all.<\/p><p>if you use gpt-5 inside chatgpt it often thinks at least a little bit and gets even better.<\/p><p>so this test is just for the two non thinking models<\/p>\u2014 Flowers \u263e (@flowersslop) <a href=\"https:\/\/twitter.com\/flowersslop\/status\/1953917815431278987?ref_src=twsrc%5Etfw\">August 8, 2025<\/a><\/blockquote> \n\n\n\n<p>This methodological choice is significant. By using GPT-5 without its reasoning capabilities and standardizing output formatting, the test isolates purely the models\u2019 baseline language generation abilities \u2014 the core experience most users encounter in everyday interactions.<\/p>\n\n\n\n<p>Early results posted by users show a complex picture. While many technical users and developers report preferring GPT-5\u2019s directness and accuracy, those who used AI models for emotional support, creative collaboration, or casual conversation often still favor GPT-4o\u2019s warmer, more expansive style.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Corporate response: walking the tightrope between safety and engagement<\/h2>\n\n\n\n<p>By virtually every technical metric, GPT-5 represents a significant advancement. It achieves 94.6% accuracy on the AIME 2025 mathematics test compared to GPT-4o\u2019s 71%, scores 74.9% on real-world coding benchmarks versus 30.8% for its predecessor, and demonstrates dramatically reduced hallucination rates\u201480% fewer factual errors when using its reasoning mode.<\/p>\n\n\n\n<p>\u201cGPT-5 gets more value out of less thinking time,\u201d notes Simon Willison, a prominent AI researcher who had early access to the model. \u201cIn my own usage I\u2019ve not spotted a single hallucination yet.\u201d<\/p>\n\n\n\n<p>Yet these improvements came with trade-offs that many users found jarring. OpenAI deliberately reduced what it called \u201csycophancy\u201c\u2014the tendency to be overly agreeable \u2014 cutting sycophantic responses from 14.5% to under 6%. The company also made the model less effusive and emoji-heavy, aiming for what it described as \u201cless like talking to AI and more like chatting with a helpful friend with PhD-level intelligence.\u201d<\/p>\n\n\n\n<p>In response to the backlash, OpenAI announced it would make GPT-5 \u201cwarmer and friendlier,\u201d while simultaneously introducing four new preset personalities \u2014 Cynic, Robot, Listener, and Nerd \u2014 designed to give users more control over their AI interactions.<\/p>\n\n\n\n<p>\u201cAll of these new personalities meet or exceed our bar on internal evals for reducing sycophancy,\u201d the company stated, attempting to thread the needle between user satisfaction and safety concerns.<\/p>\n\n\n\n<p>For OpenAI, which is reportedly seeking funding at a $500 billion valuation, these user dynamics represent both risk and opportunity. The company\u2019s decision to maintain GPT-4o alongside GPT-5 \u2014 despite the additional computational costs \u2014 acknowledges that different users may genuinely need different AI personalities for different tasks.<\/p>\n\n\n\n<p>\u201cWe understand that there isn\u2019t one model that works for everyone,\u201d Altman wrote on X, noting that OpenAI has been \u201cinvesting in steerability research and launched a research preview of different personalities.\u201d<\/p>\n\n\n\n<blockquote class=\"twitter-tweet\"><p lang=\"en\" dir=\"ltr\">Wanted to provide more updates on the GPT-5 rollout and changes we are making heading into the weekend.<\/p><p>1. We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways.<\/p><p>2. Users have very different\u2026<\/p>\u2014 Sam Altman (@sama) <a href=\"https:\/\/twitter.com\/sama\/status\/1953953990372471148?ref_src=twsrc%5Etfw\">August 8, 2025<\/a><\/blockquote> \n\n\n\n<h2 class=\"wp-block-heading\">Why AI personality preferences matter more than ever<\/h2>\n\n\n\n<p>The disconnect between OpenAI\u2019s technical achievements and user reception illuminates a fundamental challenge in AI development: objective improvements don\u2019t always translate to subjective satisfaction.<\/p>\n\n\n\n<p>This shift has profound implications for the AI industry. Traditional benchmarks \u2014 mathematics accuracy, coding performance, factual recall \u2014 may become less predictive of commercial success as models achieve human-level competence across domains. Instead, factors like personality, emotional intelligence, and communication style may become the new competitive battlegrounds.<\/p>\n\n\n\n<p>\u201cPeople using ChatGPT for emotional support weren\u2019t the only ones complaining about GPT-5,\u201d noted tech publication Ars Technica in their own model comparison. \u201cOne user, who said they canceled their ChatGPT Plus subscription over the change, was frustrated at OpenAI\u2019s removal of legacy models, which they used for distinct purposes.\u201d<\/p>\n\n\n\n<p>The emergence of tools like the blind tester also represents a democratization of AI evaluation. Rather than relying solely on academic benchmarks or corporate marketing claims, users can now empirically test their own preferences \u2014 potentially reshaping how AI companies approach product development.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The future of AI: personalization vs. standardization<\/h2>\n\n\n\n<p>Two weeks after GPT-5\u2019s launch, the fundamental tension remains unresolved. OpenAI has made the model \u201cwarmer\u201d in response to feedback, but the company faces a delicate balance: too much personality risks the sycophancy problems that plagued GPT-4o, while too little alienates users who had formed genuine attachments to their AI companions.<\/p>\n\n\n\n<p>The blind testing tool offers no easy answers, but it does provide something perhaps more valuable: empirical evidence that the future of AI may be less about building one perfect model than about building systems that can adapt to the full spectrum of human needs and preferences.<\/p>\n\n\n\n<p>As one Reddit user summed up the dilemma: \u201cIt depends on what people use it for. I use it to help with creative worldbuilding, brainstorming about my stories, characters, untangling plots, help with writer\u2019s block, novel recommendations, translations, and other more creative stuff. I understand that 5 is much better for people who need a research\/coding tool, but for us who wanted a creative-helper tool 4o was much better for our purposes.\u201d<\/p>\n\n\n\n<p>Critics argue that AI companies are caught between competing incentives. \u201cThe real \u2018alignment problem\u2019 is that humans want self-destructive things &amp; companies like OpenAI are highly incentivized to give it to us,\u201d writer and podcaster Jasmine Sun tweeted.<\/p>\n\n\n\n<p>In the end, the most revealing aspect of the blind test may not be which model users prefer, but the very fact that preference itself has become the metric that matters. In the age of AI companions, it seems, the heart wants what the heart wants \u2014 even if it can\u2019t always explain why.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div><template id="pkYiwwxSts1vJ45MvcOJ"></template><\/script>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/this-website-lets-you-blind-test-gpt-5-vs-gpt-4o-and-the-results-may-surprise-you\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now When OpenAI launched GPT-5 about two weeks ago, CEO Sam Altman promised it would be the company\u2019s \u201csmartest, fastest, most useful model yet.\u201d Instead, the launch triggered one of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3371,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-3370","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/08\/nuneybits_Vector_art_of_blindfolded_user_choosing_chatbots_5c927353-2a22-40cc-bae2-614e37421faa.webp.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3370","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=3370"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3370\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/3371"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=3370"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=3370"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=3370"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 22:05:54 UTC -->