\n\t\t\t\t

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> Subscribe Now<\/em><\/p>\n\n\n\n

\n<\/div>
OpenAI\u2019s long-awaited return to the \u201copen\u201d of its namesake occurred yesterday with the release of two new large language models (LLMs): gpt-oss-120B and gpt-oss-20B.<\/strong> <\/p>\n\n\n\n
But despite achieving technical benchmarks on par with OpenAI\u2019s other powerful proprietary AI model offerings, the broader AI developer and user community\u2019s initial response has so far been all over the map.<\/strong> If this release were a movie premiering and being graded on Rotten Tomatoes, we\u2019d be looking at a near 50% split, based on my observations. <\/p>\n\n\n\n
First some background: OpenAI has released these two new text-only language models (no image generation or analysis) both under the permissive open source Apache 2.0 license <\/strong>\u2014 the first time since 2019 (before ChatGPT) <\/strong>that the company has done so with a cutting-edge language model. <\/p>\n\n\n\n
The entire ChatGPT era of the last 2.7 years has so far been powered by proprietary or closed-source models<\/strong>, ones that OpenAI controlled and that users had to pay to access (or use a free tier subject to limits), with limited customizability and no way to run them offline or on private computing hardware.<\/p>\n\n\n\n
\n
\n\n\n\n
AI Scaling Hits Its Limits<\/strong><\/p>\n\n\n\n
Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:<\/p>\n\n\n\n
\n
Turning energy into a strategic advantage<\/li>\n\n\n\n
Architecting efficient inference for real throughput gains<\/li>\n\n\n\n
Unlocking competitive ROI with sustainable AI systems<\/li>\n<\/ul>\n\n\n\n
Secure your spot to stay ahead<\/strong>: https:\/\/bit.ly\/4mwGngO<\/p>\n\n\n\n
\n<\/div>
But that all changed thanks to the release of the pair of gpt-oss models yesterday, one larger and more powerful for use on a single Nvidia H100 GPU at say, a small or medium-sized enterprise\u2019s data center or server farm, and an even smaller one that works on a single consumer laptop or desktop PC like the kind in your home office.<\/p>\n\n\n\n
Of course, the models being so new, it\u2019s taken several hours for the AI power user community to independently run and test them out on their own individual benchmarks (measurements) and tasks. <\/p>\n\n\n\n
And now we\u2019re getting a wave of feedback ranging from optimistic enthusiasm<\/strong> about the potential of these powerful, free, and efficient new models to an undercurrent of dissatisfaction and dismay with what some users see as significant problems and limitations<\/strong>, especially compared to the wave of similarly Apache 2.0-licensed powerful open source, multimodal LLMs from Chinese startups<\/strong> (which can also be taken, customized, run locally on U.S. hardware for free by U.S. companies, or companies anywhere else around the world). <\/p>\n\n\n\n
High benchmarks, but still behind Chinese open source leaders<\/h2>\n\n\n\n
Intelligence benchmarks place the gpt-oss models ahead of most American open-source offerings. According to independent third-party AI benchmarking firm Artificial Analysis, gpt-oss-120B is \u201cthe most intelligent American open weights model,\u201d though it still falls short of Chinese heavyweights like DeepSeek R1 and Qwen3 235B.<\/strong><\/p>\n\n\n\n
$\"\"$ <\/figure>\n\n\n\n
\u201cOn reflection, that\u2019s all they did. Mogged on benchmarks,\u201d wrote self-proclaimed DeepSeek \u201cstan\u201d @teortaxesTex. \u201cNo good derivative models will be trained\u2026 No new usecases created\u2026 Barren claim to bragging rights.\u201d<\/p>\n\n\n\n
That skepticism is echoed by pseudonymous open source AI researcher Teknium (@Teknium1), co-founder of rival open source AI model provider Nous Research, who called the release \u201ca legitimate nothing burger,\u201d on X, and predicted a Chinese model will soon eclipse it. \u201cOverall very disappointed and I legitimately came open minded to this,\u201d they wrote.<\/p>\n\n\n\n
Bench-maxxing on math and coding at the expense of writing?<\/h2>\n\n\n\n
Other criticism focused on the gpt-oss models\u2019 apparent narrow usefulness. <\/strong><\/p>\n\n\n\n
AI influencer \u201cLisan al Gaib (@scaling01)\u201d noted that the models excel at math and coding but \u201ccompletely lack taste and common sense.\u201d He added, \u201cSo it\u2019s just a math model?\u201d<\/p>\n\n\n\n
In creative writing tests, some users found the model injecting equations into poetic outputs. \u201cThis is what happens when you benchmarkmax,\u201d Teknium remarked, sharing a screenshot where the model added an integral formula mid-poem.<\/p>\n\n\n\n
\n\n\n\n
And @kalomaze, a researcher at decentralized AI model training company Prime Intellect, wrote that \u201cgpt-oss-120b knows less about the world than what a good 32b does. probably wanted to avoid copyright issues so they likely pretrained on majority synth. pretty devastating stuff\u201d<\/p>\n\n\n\n
Former Googler and independent AI developer Kyle Corbitt agreed that the gpt-oss pair of models seemed to have been trained primarily on synthetic data \u2014 that is, data generated by an AI model specifically for the purposes of training a new one \u2014 making it \u201cextremely spiky.\u201d<\/p>\n\n\n\n
It\u2019s \u201cgreat at the tasks it\u2019s trained on, really bad at everything else,\u201d Corbitt wrote, i.e., great on coding and math problems, and bad at more linguistic tasks like creative writing or report generation<\/strong>.<\/p>\n\n\n\n
In other words, the charge is that OpenAI deliberately trained the model on more synthetic data than real world facts and figures to avoid using copyrighted data scraped from websites and other repositories it doesn\u2019t own or have license to use, which is something it and many other leading gen AI companies have been accused of in the past and are facing down ongoing lawsuits as a result of. <\/p>\n\n\n\n
Others speculated OpenAI may have trained the model on primarily synthetic data to avoid safety and security issues, resulting in worse quality than if it had been trained on more real world (and presumably copyrighted) data.<\/p>\n\n\n\n
Concerning third-party benchmark results<\/h2>\n\n\n\n
Moreover, evaluating the models on third-party benchmarking tests have turned up concerning metrics in some users\u2019 eyes.<\/p>\n\n\n\n
SpeechMap \u2014 which measures the performance of LLMs in complying with user prompts to generate disallowed, biased, or politically sensitive outputs \u2014 showed compliance scores for gpt-oss 120B hovering under 40%, near the bottom of peer open models,<\/strong> which indicates resistance to follow user requests and defaulting to guardrails, potentially at the expense of providing accurate information. <\/p>\n\n\n\n
In Aider\u2019s Polyglot evaluation, gpt-oss-120B scored just 41.8% in multilingual reasoning\u2014far below competitors like Kimi-K2 (59.1%) and DeepSeek-R1 (56.9%).<\/strong><\/p>\n\n\n\n
Some users also said their tests indicated the model is oddly resistant to generating criticism of China or Russia, a contrast to its treatment of the US and EU, raising questions about bias and training data filtering.<\/p>\n\n\n\n
Other experts have applauded the release and what it signals for U.S. open source AI<\/h2>\n\n\n\n
To be fair, not all the commentary is negative. Software engineer and close AI watcher Simon Willison called the release \u201creally impressive\u201d on X, elaborating in a blog post on the models\u2019 efficiency and ability to achieve parity with OpenAI\u2019s proprietary o3-mini and o4-mini models.<\/strong><\/p>\n\n\n\n
He praised their strong performance on reasoning and STEM-heavy benchmarks, and hailed the new \u201cHarmony\u201d prompt template format \u2014 which offers developers more structured terms for guiding model responses \u2014 and support for third-party tool use as meaningful contributions.<\/p>\n\n\n\n
In a lengthy X post, Clem Delangue, CEO and co-founder of AI code sharing and open source community Hugging Face, encouraged users not to rush to judgment, pointing out that inference for these models is complex, and early issues could be due to infrastructure instability and insufficient optimization among hosting providers. <\/p>\n\n\n\n
\u201cThe power of open-source is that there\u2019s no cheating,\u201d Delangue wrote.<\/strong> \u201cWe\u2019ll uncover all the strengths and limitations\u2026 progressively.\u201d<\/p>\n\n\n\n
Even more cautious was Wharton School of Business at the University of Pennsylvania professor Ethan Mollick, who wrote on X that \u201cThe US now likely has the leading open weights models (or close to it)\u201d, but questioned whether this is a one-off by OpenAI. \u201cThe lead will evaporate quickly as others catch up,\u201d<\/strong> he noted, adding that it\u2019s unclear what incentives OpenAI has to keep the models updated.<\/p>\n\n\n\n
Nathan Lambert, a leading AI researcher at the rival open source lab Allen Institute for AI (Ai2) and commentator, praised the symbolic significance of the release on his blog Interconnects, calling it \u201ca phenomenal step for the open ecosystem, especially for the West and its allies, <\/strong>that the most known brand in the AI space has returned to openly releasing models.\u201d <\/p>\n\n\n\n
But he cautioned on X that gpt-oss is \u201cunlikely to meaningfully slow down [Chinese e-commerce giant Aliaba\u2019s AI team] Qwen,\u201d <\/strong>citing its usability, performance, and variety. <\/p>\n\n\n\n
He argued the release marks an important shift in the U.S. toward open models, but that OpenAI still has a \u201clong path back\u201d to catch up in practice.<\/p>\n\n\n\n
A split verdict<\/h2>\n\n\n\n
The verdict, for now, is split. <\/p>\n\n\n\n
OpenAI\u2019s gpt-oss models are a landmark in terms of licensing and accessibility. <\/p>\n\n\n\n
But while the benchmarks look solid, the real-world \u201cvibes\u201d \u2014 as many users describe it \u2014 are proving less compelling. <\/p>\n\n\n\n
Whether developers can build strong applications and derivatives on top of gpt-oss will determine whether the release is remembered as a breakthrough or a blip.<\/p>\n
\n
\n
Daily insights on business use cases with VB Daily<\/strong><\/p>\n
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n
Read our Privacy Policy<\/p>\n
\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n
An error occured.<\/p>\n<\/p><\/div>\n
\n\t\t\t\t\t $\"\"\/$ \n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n
\r\n
Source link <\/a>","protected":false},"excerpt":{"rendered":"
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now OpenAI\u2019s long-awaited return to the \u201copen\u201d of its namesake occurred yesterday with the release of two new large language models (LLMs): gpt-oss-120B and gpt-oss-20B. But despite achieving technical benchmarks […]<\/p>\n","protected":false},"author":1,"featured_media":3023,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-3022","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/08\/ChatGPT-Image-Aug-6-2025-02_39_56-PM.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3022","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=3022"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3022\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/3023"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=3022"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=3022"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=3022"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}