{"id":3207,"date":"2025-08-19T06:41:02","date_gmt":"2025-08-19T06:41:02","guid":{"rendered":"https:\/\/violethoward.com\/new\/hugging-face-5-ways-enterprises-can-slash-ai-costs-without-sacrificing-performance\/"},"modified":"2025-08-19T06:41:02","modified_gmt":"2025-08-19T06:41:02","slug":"hugging-face-5-ways-enterprises-can-slash-ai-costs-without-sacrificing-performance","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/hugging-face-5-ways-enterprises-can-slash-ai-costs-without-sacrificing-performance\/","title":{"rendered":"Hugging Face: 5 ways enterprises can slash AI costs without sacrificing performance\u00a0"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> <em>Subscribe Now<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>Enterprises seem to accept it as a basic fact: AI models require a significant amount of compute; they simply have to find ways to obtain more of it.\u00a0<\/p>\n\n\n\n<p>But it doesn\u2019t have to be that way, according to Sasha Luccioni, AI and climate lead at Hugging Face. What if there\u2019s a smarter way to use AI? What if, instead of striving for more (often unnecessary) compute and ways to power it, they can focus on improving model performance and accuracy?\u00a0<\/p>\n\n\n\n<p>Ultimately, model makers and enterprises are focusing on the wrong issue: They should be computing <em>smarter<\/em>, not harder or doing more, Luccioni says.\u00a0<\/p>\n\n\n\n<p>\u201cThere are smarter ways of doing things that we\u2019re currently under-exploring, because we\u2019re so blinded by: We need more FLOPS, we need more GPUs, we need more time,\u201d she said.\u00a0<\/p>\n\n\n\n<div id=\"boilerplate_2803147\" class=\"post-boilerplate boilerplate-speedbump\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong\/><strong>AI Scaling Hits Its Limits<\/strong><\/p>\n\n\n\n<p>Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Turning energy into a strategic advantage<\/li>\n\n\n\n<li>Architecting efficient inference for real throughput gains<\/li>\n\n\n\n<li>Unlocking competitive ROI with sustainable AI systems<\/li>\n<\/ul>\n\n\n\n<p><strong>Secure your spot to stay ahead<\/strong>: https:\/\/bit.ly\/4mwGngO<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<\/div><p>Here are five key learnings from Hugging Face that can help enterprises of all sizes use AI more efficiently.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-1-right-size-the-model-to-the-task-nbsp\">1: Right-size the model to the task\u00a0<\/h2>\n\n\n\n<p>Avoid defaulting to giant, general-purpose models for every use case. Task-specific or distilled models can match, or even <span style=\"box-sizing: border-box; margin: 0px; padding: 0px;\">surpass,\u00a0larger models\u00a0in terms of accuracy for targeted workloads \u2014 at a lower cost and with reduced energy consumption<\/span>.\u00a0<\/p>\n\n\n\n<p>Luccioni, in fact, has found in testing that a task-specific model uses 20 to 30 times less energy than a general-purpose one. \u201cBecause it\u2019s a model that can do that one task, as opposed to any task that you throw at it, which is often the case with large language models,\u201d she said.\u00a0<\/p>\n\n\n\n<p>Distillation is key here; a full model could initially be trained from scratch and then refined for a specific task. DeepSeek R1, for instance, is \u201cso huge that most organizations can\u2019t afford to use it\u201d because you need at least 8 GPUs, Luccioni noted. By contrast, distilled versions can be 10, 20 or even 30X smaller and run on a single GPU.\u00a0<\/p>\n\n\n\n<p>In general, open-source models help with efficiency, she noted, as they don\u2019t need to be trained from scratch. That\u2019s compared to just a few years ago, when enterprises were wasting resources because they couldn\u2019t find the model they needed; nowadays, they can start out with a base model and fine-tune and adapt it.\u00a0<\/p>\n\n\n\n<p>\u201cIt provides incremental shared innovation, as opposed to siloed, everyone\u2019s training their models on their datasets and essentially wasting compute in the process,\u201d said Luccioni.\u00a0<\/p>\n\n\n\n<p>It\u2019s becoming clear that companies are quickly getting disillusioned with gen AI, as costs are not yet proportionate to the benefits. Generic use cases, such as writing emails or transcribing meeting notes, are genuinely helpful. However, task-specific models still require \u201ca lot of work\u201d because out-of-the-box models don\u2019t cut it and are also more costly, said Luccioni.<\/p>\n\n\n\n<p>This is the next frontier of added value. \u201cA lot of companies do want a specific task done,\u201d Luccioni noted. \u201cThey don\u2019t want AGI, they want specific intelligence. And that\u2019s the gap that needs to be bridged.\u201d\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-2-make-efficiency-the-default\">2. Make efficiency the default<\/h2>\n\n\n\n<p>Adopt \u201cnudge theory\u201d in system design, set conservative reasoning budgets, limit always-on generative features and require opt-in for high-cost compute modes.<\/p>\n\n\n\n<p>In cognitive science, \u201cnudge theory\u201d is a behavioral change management approach designed to influence human behavior subtly. The \u201ccanonical example,\u201d Luccioni noted, is adding cutlery to takeout: Having people decide whether they want plastic utensils, rather than automatically including them with every order, can significantly reduce waste.<\/p>\n\n\n\n<p>\u201cJust getting people to opt into something versus opting out of something is actually a very powerful mechanism for changing people\u2019s behavior,\u201d said Luccioni.\u00a0<\/p>\n\n\n\n<p>Default mechanisms are also unnecessary, as they increase use and, therefore, costs because models are doing more work than they need to. For instance, with popular search engines such as Google, a gen AI summary automatically populates at the top by default. Luccioni also noted that, when she recently used\u00a0OpenAI\u2019s GPT-5, the model automatically worked in full reasoning mode on \u201cvery simple questions.\u201d<\/p>\n\n\n\n<p>\u201cFor me, it should be the exception,\u201d she said. \u201cLike, \u2018what\u2019s the meaning of life, then sure, I want a gen AI summary.\u2019 But with \u2018What\u2019s the weather like in Montreal,\u2019 or \u2018What are the opening hours of my local pharmacy?\u2019 I do not need a generative AI summary, yet it\u2019s the default. I think that the default mode should be no reasoning.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-3-optimize-hardware-utilization\">3. Optimize hardware utilization<\/h2>\n\n\n\n<p>Use batching; adjust precision and fine-tune batch sizes for specific hardware generation to minimize wasted memory and power draw.\u00a0<\/p>\n\n\n\n<p>For instance, enterprises should ask themselves: Does the model need to be on all the time? Will people be pinging it in real time, 100 requests at once? In that case, always-on optimization is necessary, Luccioni noted. However, in many others, it\u2019s not; the model can be run periodically to optimize memory usage, and batching can ensure optimal memory utilization.\u00a0<\/p>\n\n\n\n<p>\u201cIt\u2019s kind of like an engineering challenge, but a very specific one, so it\u2019s hard to say, \u2018Just distill all the models,\u2019 or \u2018change the precision on all the models,\u2019\u201d said Luccioni.\u00a0<\/p>\n\n\n\n<p>In one of her recent studies, she found that batch size depends on hardware, even down to the specific type or version. Going from one batch size to plus-one can increase energy use because models need more memory bars.\u00a0<\/p>\n\n\n\n<p>\u201cThis is something that people don\u2019t really look at, they\u2019re just like, \u2018Oh, I\u2019m gonna maximize the batch size,\u2019 but it really comes down to tweaking all these different things, and all of a sudden it\u2019s super efficient, but it only works in your specific context,\u201d Luccioni explained.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-4-incentivize-energy-transparency\">4. Incentivize energy transparency<\/h2>\n\n\n\n<p>It always helps when people are incentivized; to this end, Hugging Face earlier this year launched AI Energy Score. It\u2019s a novel way to promote more energy efficiency, utilizing a 1- to 5-star rating system, with the most efficient models earning a \u201cfive-star\u201d status.\u00a0<\/p>\n\n\n\n<p>It could be considered the \u201cEnergy Star for AI,\u201d and was inspired by the potentially-soon-to-be-defunct federal program, which set energy efficiency specifications and branded qualifying appliances with an Energy Star logo.\u00a0<\/p>\n\n\n\n<p>\u201cFor a couple of decades, it was really a positive motivation, people wanted that star rating, right?,\u201d said Luccioni. \u201cSomething similar with Energy Score would be great.\u201d<\/p>\n\n\n\n<p>Hugging Face has a leaderboard up now, which it plans to update with new models (DeepSeek, GPT-oss) in September, and continually do so every 6 months or sooner as new models become available. The goal is that model builders will consider the rating as a \u201cbadge of honor,\u201d Luccioni said. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-5-rethink-the-more-compute-is-better-mindset\">5. Rethink the \u201cmore compute is better\u201d mindset<\/h2>\n\n\n\n<p>Instead of chasing the largest GPU clusters, begin with the question: \u201cWhat is the smartest way to achieve the result?\u201d For many workloads, smarter architectures and better-curated data outperform brute-force scaling.<\/p>\n\n\n\n<p>\u201cI think that people probably don\u2019t need as many GPUs as they think they do,\u201d said Luccioni. Instead of simply going for the biggest clusters, she urged enterprises to rethink the tasks GPUs will be completing and why they need them, how they performed those types of tasks before, and what adding extra GPUs will ultimately get them.\u00a0<\/p>\n\n\n\n<p>\u201cIt\u2019s kind of this race to the bottom where we need a bigger cluster,\u201d she said. \u201cIt\u2019s thinking about what you\u2019re using AI for, what technique do you need, what does that require?\u201d\u00a0<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/hugging-face-5-ways-enterprises-can-slash-ai-costs-without-sacrificing-performance\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Enterprises seem to accept it as a basic fact: AI models require a significant amount of compute; they simply have to find ways to obtain more of it.\u00a0 But [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3208,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-3207","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/08\/HF.webp.jpeg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3207","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=3207"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/3207\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/3208"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=3207"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=3207"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=3207"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 20:44:17 UTC -->