{"id":2160,"date":"2025-06-28T23:20:10","date_gmt":"2025-06-28T23:20:10","guid":{"rendered":"https:\/\/violethoward.com\/new\/from-hallucinations-to-hardware-lessons-from-a-real-world-computer-vision-project-gone-sideways\/"},"modified":"2025-06-28T23:20:10","modified_gmt":"2025-06-28T23:20:10","slug":"from-hallucinations-to-hardware-lessons-from-a-real-world-computer-vision-project-gone-sideways","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/from-hallucinations-to-hardware-lessons-from-a-real-world-computer-vision-project-gone-sideways\/","title":{"rendered":"From hallucinations to hardware: Lessons from a real-world computer vision project gone sideways"},"content":{"rendered":" \r\n
Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy.\u00a0Learn more<\/em><\/p>\n\n\n\n Computer vision projects rarely go exactly as planned, and this one was no exception. The idea was simple: Build a model that could look at a photo of a laptop and identify any physical damage \u2014 things like cracked screens, missing keys or broken hinges. It seemed like a straightforward use case for image models and large language models (LLMs), but it quickly turned into something more complicated.<\/p>\n\n\n\n Along the way, we ran into issues with hallucinations, unreliable outputs and images that were not even laptops. To solve these, we ended up applying an agentic framework in an atypical way \u2014 not for task automation, but to improve the model\u2019s performance.<\/p>\n\n\n\n In this post, we will walk through what we tried, what didn\u2019t work and how a combination of approaches eventually helped us build something reliable.<\/p>\n\n\n\n Our initial approach was fairly standard for a multimodal model. We used a single, large prompt to pass an image into an image-capable LLM and asked it to identify visible damage. This monolithic prompting strategy is simple to implement and works decently for clean, well-defined tasks. But real-world data rarely plays along.<\/p>\n\n\n\n We ran into three major issues early on:<\/p>\n\n\n\n This was the point when it became clear we would need to iterate.<\/p>\n\n\n\n One thing we noticed was how much image quality affected the model\u2019s output. Users uploaded all kinds of images ranging from sharp and high-resolution to blurry. This led us to refer to research highlighting how image resolution impacts deep learning models.<\/p>\n\n\n\n We trained and tested the model using a mix of high-and low-resolution images. The idea was to make the model more resilient to the wide range of image qualities it would encounter in practice. This helped improve consistency, but the core issues of hallucination and junk image handling persisted.<\/p>\n\n\n\n Encouraged by recent experiments in combining image captioning with text-only LLMs \u2014 like the technique covered in The Batch<\/em>, where captions are generated from images and then interpreted by a language model, we decided to give it a try.<\/p>\n\n\n\n Here\u2019s how it works:<\/p>\n\n\n\n While clever in theory, this approach introduced new problems for our use case:<\/p>\n\n\n\n It was an interesting experiment, but ultimately not a solution.<\/p>\n\n\n\n This was the turning point. While agentic frameworks are usually used for orchestrating task flows (think agents coordinating calendar invites or customer service actions), we wondered if breaking down the image interpretation task into smaller, specialized agents might help.<\/p>\n\n\n\n We built an agentic framework structured like this:<\/p>\n\n\n\n This modular, task-driven approach produced much more precise and explainable results. Hallucinations dropped dramatically, junk images were reliably flagged and each agent\u2019s task was simple and focused enough to control quality well.<\/p>\n\n\n\n As effective as this was, it was not perfect. Two main limitations showed up:<\/p>\n\n\n\n We needed a way to balance precision with coverage.<\/p>\n\n\n\n To bridge the gaps, we created a hybrid system:<\/p>\n\n\n\n This combination gave us the precision and explainability of the agentic setup, the broad coverage of monolithic prompting and the confidence boost of targeted fine-tuning.<\/p>\n\n\n\n A few things became clear by the time we wrapped up this project:<\/p>\n\n\n\n What started as a simple idea, using an LLM prompt to detect physical damage in laptop images,\u00a0quickly turned into a much deeper experiment in combining different AI techniques to tackle unpredictable, real-world problems. Along the way, we realized that some of the most useful tools were ones not originally designed for this type of work.<\/p>\n\n\n\n Agentic frameworks, often seen as workflow utilities, proved surprisingly effective when repurposed for tasks like structured damage detection and image filtering. With a bit of creativity, they helped us build a system that was not just more accurate, but easier to understand and manage in practice.<\/p>\n\n\n\n Shruti Tiwari is an AI product manager at Dell Technologies.<\/em><\/p>\n\n\n\n Vadiraj Kulkarni is a data scientist at Dell Technologies.<\/em><\/p>\n
\n<\/div>Where we started: Monolithic prompting<\/h2>\n\n\n\n
\n
First fix: Mixing image resolutions<\/h2>\n\n\n\n
The multimodal detour: Text-only LLM goes multimodal<\/h2>\n\n\n\n
\n
\n
A creative use of agentic frameworks<\/h2>\n\n\n\n
\n
The blind spots: Trade-offs of an agentic approach<\/h2>\n\n\n\n
\n
The hybrid solution: Combining agentic and monolithic approaches<\/h2>\n\n\n\n
\n
What we learned<\/h2>\n\n\n\n
\n
Final thoughts<\/h2>\n\n\n\n