{"id":940,"date":"2025-04-03T03:49:43","date_gmt":"2025-04-03T03:49:43","guid":{"rendered":"https:\/\/violethoward.com\/new\/beyond-generic-benchmarks-how-yourbench-lets-enterprises-evaluate-ai-models-against-actual-data\/"},"modified":"2025-04-03T03:49:43","modified_gmt":"2025-04-03T03:49:43","slug":"beyond-generic-benchmarks-how-yourbench-lets-enterprises-evaluate-ai-models-against-actual-data","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/beyond-generic-benchmarks-how-yourbench-lets-enterprises-evaluate-ai-models-against-actual-data\/","title":{"rendered":"Beyond generic benchmarks: How Yourbench lets enterprises evaluate AI models against actual data"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix.\u00a0<\/p>\n\n\n\n<p>However, these benchmarks often test for general capabilities. For organizations that want to use models and large language model-based agents, it\u2019s harder to evaluate how well the agent or the model actually understands their specific needs.\u00a0<\/p>\n\n\n\n<p>Model repository Hugging Face launched Yourbench, an open-source tool where developers and enterprises can create their own benchmarks to test model performance against their internal data.\u00a0<\/p>\n\n\n\n<p>Sumuk Shashidhar, part of the evaluations research team at Hugging Face, announced Yourbench on X. The feature offers \u201ccustom benchmarking and synthetic data generation from ANY of your documents. It\u2019s a big step towards improving how model evaluations work.\u201d<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcXlLX3ONu_kgW9dXwNON1Nlj8dUQEY4-joW5CI-UkzCik8Kyue-nifY7Si2XnimRGTuEFYC3SP3MjylCgsFbFvdRzliejILz5AgapXlKO090Az0FHiBIcaQbkByw5g_fU7ZcawNg?key=DW9pUHI0MxDh2Yj_VryLSCav\" alt=\"\"\/><\/figure>\n\n\n\n<p>He added that Hugging Face knows \u201cthat for many use cases what really matters is how well a model performs your specific task. Yourbench lets you evaluate models on what matters to you.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-creating-custom-evaluations\">Creating custom evaluations<\/h2>\n\n\n\n<p>Hugging Face said in a paper that Yourbench works by replicating subsets of the Massive Multitask Language Understanding (MMLU) benchmark \u201cusing minimal source text, achieving this for under $15 in total inference cost while perfectly preserving the relative model performance rankings.\u201d\u00a0<\/p>\n\n\n\n<p>Organizations need to pre-process their documents before Yourbench can work. This involves three stages: <\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Document Ingestion<\/strong> to \u201cnormalize\u201d file formats.<\/li>\n\n\n\n<li><strong>Semantic Chunking<\/strong> to break down the documents to meet context window limits and focus the model\u2019s attention.<\/li>\n\n\n\n<li><strong>Document Summarization<\/strong><\/li>\n<\/ul>\n\n\n\n<p>Next comes the question-and-answer generation process, which creates questions from information on the documents. This is where the user brings in their chosen LLM to see which one best answers the questions.\u00a0<\/p>\n\n\n\n<p>Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba\u2019s Qwen models including the reasoning model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 Haiku.<\/p>\n\n\n\n<p>Shashidhar said Hugging Face also offers cost analysis on the models and found that Qwen and Gemini 2.0 Flash \u201cproduce tremendous value for very very low costs.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-compute-limitations\">Compute limitations<\/h2>\n\n\n\n<p>However, creating custom LLM benchmarks based on an organization\u2019s documents comes at a cost. Yourbench requires a lot of compute power to work.\u00a0Shashidhar said on X that the company is \u201cadding capacity\u201d as fast they could.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdqLU5ydJ_9Cjmo4Z5n0dzcCGIyhuysS45tjOVFBJvjKLagB9hUpgSsAp3hWO8dQwyvY543TDQ1hK16FwrezALnfPmidXG5tYv8wiUrf578_mjbNRpbICqnU0Bz3lezINev9xBSww?key=DW9pUHI0MxDh2Yj_VryLSCav\" alt=\"\"\/><\/figure>\n\n\n\n<p>Hugging Face runs several GPUs and partners with companies like Google to use their cloud services for inference tasks. VentureBeat reached out to Hugging Face about Yourbench\u2019s compute usage.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-benchmarking-is-not-perfect\">Benchmarking is not perfect<\/h2>\n\n\n\n<p>Benchmarks and other evaluation methods give users an idea of how well models perform, but these do not perfectly capture how the models will work daily.<\/p>\n\n\n\n<p>Some have even voiced skepticism that benchmark tests show models\u2019 limitations and can lead to false conclusions about their safety and performance. A study also warned that benchmarking agents could be \u201cmisleading.\u201d<\/p>\n\n\n\n<p>However, enterprises cannot avoid evaluating models now that there are many choices in the market, and technology leaders justify the rising cost of using AI models. This has led to different methods to test model performance and reliability.\u00a0<\/p>\n\n\n\n<p>Google DeepMind introduced FACTS Grounding, which tests a model\u2019s ability to generate factually accurate responses based on information from documents. Some Yale and Tsinghua University researchers developed self-invoking code benchmarks to guide enterprises for which coding LLMs work for them.\u00a0<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/beyond-generic-benchmarks-how-yourbench-lets-enterprises-evaluate-ai-models-against-actual-data\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix.\u00a0 However, these benchmarks often test for general capabilities. For organizations that want to use models [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":941,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-940","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/04\/nuneybits_Vector_art_of_a_smiling_hugging_face_emoji_with_arms__a0538182-2148-4c6e-a5cb-6b49ebd883ea.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/940","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=940"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/940\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/941"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=940"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=940"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=940"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 01:24:35 UTC -->