{"id":2247,"date":"2025-07-03T00:07:52","date_gmt":"2025-07-03T00:07:52","guid":{"rendered":"https:\/\/violethoward.com\/new\/confidence-in-agentic-ai-why-eval-infrastructure-must-come-first\/"},"modified":"2025-07-03T00:07:52","modified_gmt":"2025-07-03T00:07:52","slug":"confidence-in-agentic-ai-why-eval-infrastructure-must-come-first","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/confidence-in-agentic-ai-why-eval-infrastructure-must-come-first\/","title":{"rendered":"Confidence in agentic AI: Why eval infrastructure must come first"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p>As AI agents enter real-world deployment, organizations are under pressure to define where they belong, how to build them effectively, and how to operationalize them at scale. At VentureBeat\u2019s Transform 2025, tech leaders gathered to talk about how they\u2019re transforming their business with agents: Joanne Chen, general partner at Foundation Capital; Shailesh Nalawadi, VP of project management with Sendbird; Thys Waanders, SVP of AI transformation at Cognigy; and Shawn Malhotra, CTO, Rocket Companies.<\/p>\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\">\n<p>\n<iframe loading=\"lazy\" title=\"Engineer Autonomous AI Agents \u2013 Infrastructure for Next-Level Customer Experience VB Transform 2025\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/DChzGCf1pOo?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/p>\n<\/figure>\n<h2 class=\"wp-block-heading\" id=\"h-a-few-top-agentic-ai-use-cases\"><strong>A few top agentic AI use cases<\/strong><\/h2>\n<p>\u201cThe initial attraction of any of these deployments for AI agents tends to be around saving human capital \u2014 the math is pretty straightforward,\u201d Nalawadi said. \u201cHowever, that undersells the transformational capability you get with AI agents.\u201d<\/p>\n<p>At Rocket, AI agents have proven to be powerful tools in increasing website conversion.<\/p>\n<p>\u201cWe\u2019ve found that with our agent-based experience, the conversational experience on the website, clients are three times more likely to convert when they come through that channel,\u201d Malhotra said.<\/p>\n<p>But that\u2019s just scratching the surface. For instance, a Rocket engineer built an agent in just two days to automate a highly specialized task: calculating transfer taxes during mortgage underwriting.<\/p>\n<p>\u201cThat two days of effort saved us a million dollars a year in expense,\u201d Malhotra said. \u201cIn 2024, we saved more than a million team member hours, mostly off the back of our AI solutions. That\u2019s not just saving expense. It\u2019s also allowing our team members to focus their time on people making what is often the largest financial transaction of their life.\u201d<\/p>\n<p>Agents are essentially supercharging individual team members. That million hours saved isn\u2019t the entirety of someone\u2019s job replicated many times. It\u2019s fractions of the job that are things employees don\u2019t enjoy doing, or weren\u2019t adding value to the client. And that million hours saved gives Rocket the capacity to handle more business.<\/p>\n<p>\u201cSome of our team members were able to handle 50% more clients last year than they were the year before,\u201d Malhotra added. \u201cIt means we can have higher throughput, drive more business, and again, we see higher conversion rates because they\u2019re spending the time understanding the client\u2019s needs versus doing a lot of more rote work that the AI can do now.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-tackling-agent-complexity\"><strong>Tackling agent complexity<\/strong><\/h2>\n<p>\u201cPart of the journey for our engineering teams is moving from the mindset of software engineering \u2013 write once and test it and it runs and gives the same answer 1,000 times \u2013 to the more probabilistic approach, where you ask the same thing of an LLM and it gives different answers through some probability,\u201d Nalawadi said. \u201cA lot of it has been bringing people along. Not just software engineers, but product managers and UX designers.\u201d<\/p>\n<p>What\u2019s helped is that LLMs have come a long way, Waanders said. If they built something 18 months or two years ago, they really had to pick the right model, or the agent would not perform as expected. Now, he says, we\u2019re now at a stage where most of the mainstream models behave very well. They\u2019re more predictable. But today the challenge is combining models, ensuring responsiveness, orchestrating the right models in the right sequence and weaving in the right data.<\/p>\n<p>\u201cWe have customers that push tens of millions of conversations per year,\u201d Waanders said. \u201cIf you automate, say, 30 million conversations in a year, how does that scale in the LLM world? That\u2019s all stuff that we had to discover, simple stuff, from even getting the model availability with the cloud providers. Having enough quota with a ChatGPT model, for example. Those are all learnings that we had to go through, and our customers as well. It\u2019s a brand-new world.\u201d<\/p>\n<p>A layer above orchestrating the LLM is orchestrating a network of agents, Malhotra said. A conversational experience has a network of agents under the hood, and the orchestrator is deciding which agent to farm the request out to from those available.<\/p>\n<p>\u201cIf you play that forward and think about having hundreds or thousands of agents who are capable of different things, you get some really interesting technical problems,\u201d he said. \u201cIt\u2019s becoming a bigger problem, because latency and time matter. That agent routing is going to be a very interesting problem to solve over the coming years.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-tapping-into-vendor-relationships\"><strong>Tapping into vendor relationships<\/strong><\/h2>\n<p>Up to this point, the first step for most companies launching agentic AI has been building in-house, because specialized tools didn\u2019t yet exist. But you can\u2019t differentiate and create value by building generic LLM infrastructure or AI infrastructure, and you need specialized expertise to go beyond the initial build, and debug, iterate, and improve on what\u2019s been built, as well as maintain the infrastructure.<\/p>\n<p>\u201cOften we find the most successful conversations we have with prospective customers tend to be someone who\u2019s already built something in-house,\u201d Nalawadi said. \u201cThey quickly realize that getting to a 1.0 is okay, but as the world evolves and as the infrastructure evolves and as they need to swap out technology for something new, they don\u2019t have the ability to orchestrate all these things.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-preparing-for-agentic-ai-complexity\"><strong>Preparing for agentic AI complexity<\/strong><\/h2>\n<p>Theoretically, agentic AI will only grow in complexity \u2014 the number of agents in an organization will rise, and they\u2019ll start learning from each other, and the number of use cases will explode. How can organizations prepare for the challenge?<\/p>\n<p>\u201cIt means that the checks and balances in your system will get stressed more,\u201d Malhotra said. \u201cFor something that has a regulatory process, you have a human in the loop to make sure that someone is signing off on this. For critical internal processes or data access, do you have observability? Do you have the right alerting and monitoring so that if something goes wrong, you know it\u2019s going wrong? It\u2019s doubling down on your detection, understanding where you need a human in the loop, and then trusting that those processes are going to catch if something does go wrong. But because of the power it unlocks, you have to do it.\u201d<\/p>\n<p>So how can you have confidence that an AI agent will behave reliably as it evolves?<\/p>\n<p>\u201cThat part is really difficult if you haven\u2019t thought about it at the beginning,\u201d Nalawadi said. \u201cThe short answer is, before you even start building it, you should have an eval infrastructure in place. Make sure you have a rigorous environment in which you know what good looks like, from an AI agent, and that you have this test set. Keep referring back to it as you make improvements. A very simplistic way of thinking about eval is that it\u2019s the unit tests for your agentic system.\u201d<\/p>\n<p>The problem is, it\u2019s non-deterministic, Waanders added. Unit testing is critical, but the biggest challenge is you don\u2019t know what you don\u2019t know \u2014 what incorrect behaviors an agent could possibly display, how it might react in any given situation.<\/p>\n<p>\u201cYou can only find that out by simulating conversations at scale, by pushing it under thousands of different scenarios, and then analyzing how it holds up and how it reacts,\u201d Waanders said.<\/p>\n<\/p><\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/venturebeat.com\/ai\/confidence-in-agentic-ai-why-eval-infrastructure-must-come-first\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As AI agents enter real-world deployment, organizations are under pressure to define where they belong, how to build them effectively, and how to operationalize them at scale. At VentureBeat\u2019s Transform 2025, tech leaders gathered to talk about how they\u2019re transforming their business with agents: Joanne Chen, general partner at Foundation Capital; Shailesh Nalawadi, VP of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2248,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-2247","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/07\/VBTRANSFORM25-0666-X3.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2247","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=2247"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2247\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/2248"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=2247"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=2247"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=2247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d79d7d46fa5cbf45858bd1. Config Timestamp: 2026-04-09 12:37:16 UTC, Cached Timestamp: 2026-04-29 12:06:32 UTC -->