{"id":3225,"date":"2025-08-20T04:18:57","date_gmt":"2025-08-20T04:18:57","guid":{"rendered":"https:\/\/violethoward.com\/new\/stop-benchmarking-in-the-lab-inclusion-arena-shows-how-llms-perform-in-production\/"},"modified":"2025-08-20T04:18:57","modified_gmt":"2025-08-20T04:18:57","slug":"stop-benchmarking-in-the-lab-inclusion-arena-shows-how-llms-perform-in-production","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/stop-benchmarking-in-the-lab-inclusion-arena-shows-how-llms-perform-in-production\/","title":{"rendered":"Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production"},"content":{"rendered":" \r\n
\n\t\t\t\t
\n

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> Subscribe Now<\/em><\/p>\n\n\n\n


\n<\/div>

Benchmark testing models have become essential for enterprises, allowing them to choose the type of performance that resonates with their needs. But not all benchmarks are built the same and many test models are based on static datasets or testing environments.\u00a0<\/p>\n\n\n\n

Researchers from Inclusion AI, which is affiliated with Alibaba\u2019s Ant Group, proposed a new model leaderboard and benchmark that focuses more on a model\u2019s performance in real-life scenarios. They argue that LLMs need a leaderboard that takes into account how people use them and how much people prefer their answers compared to the static knowledge capabilities models have.\u00a0<\/p>\n\n\n\n

In a paper, the researchers laid out the foundation for Inclusion Arena, which ranks models based on user preferences.\u00a0\u00a0<\/p>\n\n\n\n

\u201cTo address these gaps, we propose Inclusion Arena, a live leaderboard that bridges real-world AI-powered applications with state-of-the-art LLMs and MLLMs. Unlike crowdsourced platforms, our system randomly triggers model battles during multi-turn human-AI dialogues in real-world apps,\u201d the paper said.\u00a0<\/p>\n\n\n\n

\n
\n\n\n\n

AI Scaling Hits Its Limits<\/strong><\/p>\n\n\n\n

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:<\/p>\n\n\n\n