{"id":1957,"date":"2025-06-10T09:44:07","date_gmt":"2025-06-10T09:44:07","guid":{"rendered":"https:\/\/violethoward.com\/new\/your-ai-models-are-failing-in-production-heres-how-to-fix-model-selection\/"},"modified":"2025-06-10T09:44:07","modified_gmt":"2025-06-10T09:44:07","slug":"your-ai-models-are-failing-in-production-heres-how-to-fix-model-selection","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/your-ai-models-are-failing-in-production-heres-how-to-fix-model-selection\/","title":{"rendered":"Your AI models are failing in production\u2014Here&#8217;s how to fix model selection"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>Enterprises need to know if the models that power their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is hard to predict specific scenarios. A revamped version of the RewardBench benchmark looks to give organizations a better idea of a model\u2019s real-life performance.\u00a0<\/p>\n\n\n\n<p>The Allen Institute for AI (Ai2) launched RewardBench 2, an updated version of its reward model benchmark, RewardBench, which they claim provides a more holistic view of model performance and assesses how models align with an enterprise\u2019s goals and standards.\u00a0<\/p>\n\n\n\n<p>Ai2 built RewardBench with classification tasks that measure correlations through inference-time compute and downstream training. RewardBench mainly deals with reward models (RM), which can act as judges and evaluate LLM outputs. RMs assign a score or a \u201creward\u201d that guides reinforcement learning with human feedback (RHLF).<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling. <a href=\"https:\/\/t.co\/NGetvNrOQV\">pic.twitter.com\/NGetvNrOQV<\/a><\/p>\u2014 Ai2 (@allen_ai) <a href=\"https:\/\/twitter.com\/allen_ai\/status\/1929576050352111909?ref_src=twsrc%5Etfw\">June 2, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<p>Nathan Lambert, a senior research scientist at Ai2, told VentureBeat that the first RewardBench worked as intended when it was launched. Still, the model environment rapidly evolved, and so should its benchmarks.\u00a0<\/p>\n\n\n\n<p>\u201cAs reward models became more advanced and use cases more nuanced, we quickly recognized with the community that the first version didn\u2019t fully capture the complexity of real-world human preferences,\u201d he said.\u00a0<\/p>\n\n\n\n<p>Lambert added that with RewardBench 2, \u201cwe set out to improve both the breadth and depth of evaluation\u2014incorporating more diverse, challenging prompts and refining the methodology to reflect better how humans actually judge AI outputs in practice.\u201d He said the second version uses unseen human prompts, has a more challenging scoring setup and new domains.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-using-evaluations-for-models-that-evaluate\">Using evaluations for models that evaluate<\/h2>\n\n\n\n<p>While reward models test how well models work, it\u2019s also important that RMs align with company values; otherwise, the fine-tuning and reinforcement learning process can reinforce bad behavior, such as hallucinations, reduce generalization, and score harmful responses too high.<\/p>\n\n\n\n<p>RewardBench 2 covers six different domains: factuality, precise instruction following, math, safety, focus and ties.<\/p>\n\n\n\n<p>\u201cEnterprises should use RewardBench 2 in two different ways depending on their application. If they\u2019re performing RLHF themselves, they should adopt the best practices and datasets from leading models in their own pipelines because reward models need on-policy training recipes (i.e. reward models that mirror the model they\u2019re trying to train with RL). For inference time scaling or data filtering, RewardBench 2 has shown that they can select the best model for their domain and see correlated performance,\u201d Lambert said.\u00a0<\/p>\n\n\n\n<p>Lambert noted that benchmarks like RewardBench offer users a way to evaluate the models they\u2019re choosing based on the \u201cdimensions that matter most to them, rather than relying on a narrow one-size-fits-all score.\u201d He said the idea of performance, which many evaluation methods claim to assess, is very subjective because a good response from a model highly depends on the context and goals of the user. At the same time, human preferences get very nuanced.\u00a0<\/p>\n\n\n\n<p>Ai2 released the first version of RewardBench in March 2024. At the time, the company said it was the first benchmark and leaderboard for reward models. Since then, several methods for benchmarking and improving RM have emerged. Researchers at Meta\u2019s FAIR came out with reWordBench. DeepSeek released a new technique called Self-Principled Critique Tuning for smarter and scalable RM.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"twitter-tweet\" data-width=\"500\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Super excited that our second reward model evaluation is out. It&#8217;s substantially harder, much cleaner, and well correlated with downstream PPO\/BoN sampling. <\/p><p>Happy hillclimbing!<\/p><p>Huge congrats to <a href=\"https:\/\/twitter.com\/saumyamalik44?ref_src=twsrc%5Etfw\">@saumyamalik44<\/a> who lead the project with a total commitment to excellence. https:\/\/t.co\/c0b6rHTXY5<\/p>\u2014 Nathan Lambert (@natolambert) <a href=\"https:\/\/twitter.com\/natolambert\/status\/1929577037896777891?ref_src=twsrc%5Etfw\">June 2, 2025<\/a><\/blockquote>\n<\/div><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-models-performed\">How models performed<\/h2>\n\n\n\n<p>Since RewardBench 2 is an updated version of RewardBench, Ai2 tested both existing and newly trained models to see if they continue to rank high. These included a variety of models, such as versions of Gemini, Claude, GPT-4.1, and Llama-3.1, along with datasets and models like Qwen, Skywork, and its own Tulu.\u00a0<\/p>\n\n\n\n<p>The company found that larger reward models perform best on the benchmark because their base models are stronger. Overall, the strongest-performing models are variants of Llama-3.1 Instruct. In terms of focus and safety, Skywork data \u201cis particularly helpful,\u201d and Tulu did well on factuality.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXd5nQSE4_DgDBXRJzMDnmYFdTuJuDmUjj0EqRptD3S-g3wqXAS7EQ9MVVsAcbLPqZmcgX9H4r4SDkchee3AlbK6vz9mGp8dvv3s4DjcONCRUG6uLnaZHLhSB96CND_LIV7ml-pjvQ?key=NN_7kK_imh5HExYdE2x5cg\" alt=\"\"\/><\/figure>\n\n\n\n<p>Ai2 said that while they believe RewardBench 2 \u201cis a step forward in broad, multi-domain accuracy-based evaluation\u201d for reward models, they cautioned that model evaluation should be mainly used as a guide to pick models that work best with an enterprise\u2019s needs.\u00a0<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div><template id="2aKSEJJFve0YRSK6DWMK"></template><\/script>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/your-ai-models-are-failing-in-production-heres-how-to-fix-model-selection\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more Enterprises need to know if the models that power their applications and agents work in real-life scenarios. This type of evaluation can sometimes be complex because it is hard to predict specific [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1958,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1957","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/06\/crimedy7_illustration_of_a_robot_rewarding_another_robot_abstra_11fd0825-4ec3-4e18-80c7-dd371d901a25.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1957","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1957"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1957\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1958"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1957"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1957"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1957"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 09:45:09 UTC -->