{"id":4256,"date":"2025-11-06T00:39:35","date_gmt":"2025-11-06T00:39:35","guid":{"rendered":"https:\/\/violethoward.com\/new\/ais-capacity-crunch-latency-risk-escalating-costs-and-the-coming-surge-pricing-breakpoint\/"},"modified":"2025-11-06T00:39:35","modified_gmt":"2025-11-06T00:39:35","slug":"ais-capacity-crunch-latency-risk-escalating-costs-and-the-coming-surge-pricing-breakpoint","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/ais-capacity-crunch-latency-risk-escalating-costs-and-the-coming-surge-pricing-breakpoint\/","title":{"rendered":"AI\u2019s capacity crunch: Latency risk, escalating costs, and the coming surge-pricing breakpoint"},"content":{"rendered":"<p> <br \/>\n<br \/><img decoding=\"async\" src=\"https:\/\/images.ctfassets.net\/jdtwqhzvc2n1\/4htZ4RXS4bpdlPbXEaKYhp\/1dd9f6065c85eae3d9e3e44acda4c7fa\/IMG_8825.jpg?w=300&amp;q=30\" \/><\/p>\n<p>The latest big headline in AI isn\u2019t model size or multimodality \u2014 it\u2019s the capacity crunch. At VentureBeat\u2019s latest AI Impact stop in NYC, Val Bercovici, chief AI officer at WEKA, joined Matt Marshall, VentureBeat CEO, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs.<\/p>\n<p>Those forces, Bercovici argued, are pushing AI toward its own version of surge pricing. Uber famously introduced surge pricing, bringing real-time market rates to ridesharing for the first time. Now, Bercovici argued, AI is headed toward the same economic reckoning \u2014 especially for inference \u2014 when the focus turns to profitability.<\/p>\n<p>&quot;We don&#x27;t have real market rates today. We have subsidized rates. That\u2019s been necessary to enable a lot of the innovation that\u2019s been happening, but sooner or later \u2014 considering the trillions of dollars of capex we\u2019re talking about right now, and the finite energy opex \u2014 real market rates are going to appear; perhaps next year, certainly by 2027,&quot; he said. &quot;When they do, it will fundamentally change this industry and drive an even deeper, keener focus on efficiency.&quot;<\/p>\n<h3><b>The economics of the token explosion<\/b><\/h3>\n<p>&quot;The first rule is that this is an industry where more is more. More tokens equal exponentially more business value,&quot; Bercovici said. <\/p>\n<p>But so far, no one&#x27;s figured out how to make that sustainable. The classic business triad \u2014 cost, quality, and speed \u2014 translates in AI to latency, cost, and accuracy (especially in output tokens). And accuracy is non-negotiable. That holds not only for consumer interactions with agents like ChatGPT, but for high-stakes use cases such as drug discovery and business workflows in heavily regulated industries like financial services and healthcare.<\/p>\n<p>&quot;That\u2019s non-negotiable,&quot; Bercovici said. &quot;You have to have a high amount of tokens for high inference accuracy, especially when you add security into the mix, guardrail models, and quality models. Then you\u2019re trading off latency and cost. That\u2019s where you have some flexibility. If you can tolerate high latency, and sometimes you can for consumer use cases, then you can have lower cost, with free tiers and low cost-plus tiers.&quot; <\/p>\n<p>However, latency is a critical bottleneck for AI agents. \u201cThese agents now don&#x27;t operate in any singular sense. You either have an agent swarm or no agentic activity at all,\u201d Bercovici noted.<\/p>\n<p>In a swarm, groups of agents work in parallel to complete a larger objective. An orchestrator agent \u2014 the smartest model \u2014 sits at the center, determining subtasks and key requirements: architecture choices, cloud vs. on-prem execution, performance constraints, and security considerations. The swarm then executes all subtasks, effectively spinning up numerous concurrent inference users in parallel sessions. Finally, evaluator models judge whether the overall task was successfully completed.<\/p>\n<p>\u201cThese swarms go through what&#x27;s called multiple turns, hundreds if not thousands of prompts and responses until the swarm convenes on an answer,\u201d Bercovici said. <\/p>\n<p>\u201cAnd if you have a compound delay in those thousand turns, it becomes untenable. So latency is really, really important. And that means typically having to pay a high price today that&#x27;s subsidized, and that&#x27;s what&#x27;s going to have to come down over time.\u201d<\/p>\n<h3><b>Reinforcement learning as the new paradigm<\/b><\/h3>\n<p>Until around May of this year, agents weren&#x27;t that performant, Bercovici explained. And then context windows became large enough, and GPUs available enough, to support agents that could complete advanced tasks, like writing reliable software. It&#x27;s now estimated that in some cases, 90% of software is generated by coding agents. Now that agents have essentially come of age, Bercovici noted, reinforcement learning is the new conversation among data scientists at some of the leading labs, like OpenAI, Anthropic, and Gemini, who view it as a critical path forward in AI innovation..<\/p>\n<p>&quot;The current AI season is reinforcement learning. It blends many of the elements of training and inference into one unified workflow,\u201d Bercovici said. \u201cIt\u2019s the latest and greatest scaling law to this mythical milestone we\u2019re all trying to reach called AGI \u2014 artificial general intelligence,\u201d he added. &quot;What\u2019s fascinating to me is that you have to apply all the best practices of how you train models, plus all the best practices of how you infer models, to be able to iterate these thousands of reinforcement learning loops and advance the whole field.&quot;<\/p>\n<h3><b>The path to AI profitability <\/b><\/h3>\n<p>There\u2019s no one answer when it comes to building an infrastructure foundation to make AI profitable, Bercovici said, since it&#x27;s still an emerging field. There\u2019s no cookie-cutter approach. Going all on-prem may be the right choice for some \u2014 especially frontier model builders \u2014 while being cloud-native or running in a hybrid environment may be a better path for organizations looking to innovate agilely and responsively. Regardless of which path they choose initially, organizations will need to adapt their AI infrastructure strategy as their business needs evolve.<\/p>\n<p>&quot;Unit economics are what fundamentally matter here,&quot; said Bercovici. &quot;We are definitely in a boom, or even in a bubble, you could say, in some cases, since the underlying AI economics are being subsidized. But that doesn\u2019t mean that if tokens get more expensive, you\u2019ll stop using them. You\u2019ll just get very fine-grained in terms of how you use them.&quot; <\/p>\n<p>Leaders should focus less on individual token pricing and more on transaction-level economics, where efficiency and impact become visible, Bercovici concludes. <\/p>\n<p>The pivotal question enterprises and AI companies should be asking, Bercovici said, is \u201cWhat is the real cost for my unit economics?\u201d<\/p>\n<p>Viewed through that lens, the path forward isn\u2019t about doing less with AI \u2014 it\u2019s about doing it smarter and more efficiently at scale.<\/p>\n<p><br \/>\n<br \/><a href=\"https:\/\/venturebeat.com\/ai\/ais-capacity-crunch-latency-risk-escalating-costs-and-the-coming-surge\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The latest big headline in AI isn\u2019t model size or multimodality \u2014 it\u2019s the capacity crunch. At VentureBeat\u2019s latest AI Impact stop in NYC, Val Bercovici, chief AI officer at WEKA, joined Matt Marshall, VentureBeat CEO, to discuss what it really takes to scale AI amid rising latency, cloud lock-in, and runaway costs. Those forces, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4257,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-4256","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/11\/IMG_8825.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/4256","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=4256"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/4256\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/4257"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=4256"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=4256"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=4256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d79d7d46fa5cbf45858bd1. Config Timestamp: 2026-04-09 12:37:16 UTC, Cached Timestamp: 2026-04-30 00:57:21 UTC -->