{"id":1739,"date":"2025-05-23T16:27:00","date_gmt":"2025-05-23T16:27:00","guid":{"rendered":"https:\/\/violethoward.com\/new\/why-enterprise-rag-systems-fail-google-study-introduces-sufficient-context-solution\/"},"modified":"2025-05-23T16:27:00","modified_gmt":"2025-05-23T16:27:00","slug":"why-enterprise-rag-systems-fail-google-study-introduces-sufficient-context-solution","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/why-enterprise-rag-systems-fail-google-study-introduces-sufficient-context-solution\/","title":{"rendered":"Why enterprise RAG systems fail: Google study introduces &#8216;sufficient context&#8217; solution"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>A new study from Google researchers introduces \u201csufficient context,\u201d a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs).<\/p>\n\n\n\n<p>This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-persistent-challenges-of-rag\">The persistent challenges of RAG<\/h2>\n\n\n\n<p>RAG systems have become a cornerstone for building more factual and verifiable AI applications. However, these systems can exhibit undesirable traits. They might confidently provide incorrect answers even when presented with retrieved evidence, get distracted by irrelevant information in the context, or fail to extract answers from long text snippets properly.<\/p>\n\n\n\n<p>The researchers state in their paper, \u201cThe ideal outcome is for the LLM to output the correct answer if the provided context contains enough information to answer the question when combined with the model\u2019s parametric knowledge. Otherwise, the model should abstain from answering and\/or ask for more information.\u201d<\/p>\n\n\n\n<p>Achieving this ideal scenario requires building models that can determine whether the provided context can help answer a question correctly and use it selectively. Previous attempts to address this have examined how LLMs behave with varying degrees of information. However, the Google paper argues that \u201cwhile the goal seems to be to understand how LLMs behave when they do or do not have sufficient information to answer the query, prior work fails to address this head-on.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-sufficient-context\">Sufficient context<\/h2>\n\n\n\n<p>To tackle this, the researchers introduce the concept of \u201csufficient context.\u201d At a high level, input instances are classified based on whether the provided context contains enough information to answer the query. This splits contexts into two cases:<\/p>\n\n\n\n<p><strong>Sufficient Context<\/strong>: The context has all the necessary information to provide a definitive answer.<\/p>\n\n\n\n<p><strong>Insufficient Context<\/strong>: The context lacks the necessary information. This could be because the query requires specialized knowledge not present in the context, or the information is incomplete, inconclusive or contradictory.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"492\" height=\"586\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_3b4e63.png\" alt=\"\" class=\"wp-image-3008685\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_3b4e63.png 492w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_3b4e63.png?resize=300,357 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_3b4e63.png?resize=400,476 400w\" sizes=\"(max-width: 492px) 100vw, 492px\"\/><figcaption class=\"wp-element-caption\"><em>Source: arXiv<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>This designation is determined by looking at the question and the associated context without needing a ground-truth answer. This is vital for real-world applications where ground-truth answers are not readily available during inference.<\/p>\n\n\n\n<p>The researchers developed an LLM-based \u201cautorater\u201d to automate the labeling of instances as having sufficient or insufficient context. They found that Google\u2019s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy.<\/p>\n\n\n\n<p>The paper notes, \u201cIn real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-key-findings-on-llm-behavior-with-rag\">Key findings on LLM behavior with RAG<\/h2>\n\n\n\n<p>Analyzing various models and datasets through this lens of sufficient context revealed several important insights.<\/p>\n\n\n\n<p>As expected, models generally achieve higher accuracy when the context is sufficient. However, even with sufficient context, models tend to hallucinate more often than they abstain. When the context is insufficient, the situation becomes more complex, with models exhibiting both higher rates of abstention and, for some models, increased hallucination.<\/p>\n\n\n\n<p>Interestingly, while RAG generally improves overall performance, additional context can also reduce a model\u2019s ability to abstain from answering when it doesn\u2019t have sufficient information. \u201cThis phenomenon may arise from the model\u2019s increased confidence in the presence of any contextual information, leading to a higher propensity for hallucination rather than abstention,\u201d the researchers suggest.<\/p>\n\n\n\n<p>A particularly curious observation was the ability of models sometimes to provide correct answers even when the provided context was deemed insufficient. While a natural assumption is that the models already \u201cknow\u201d the answer from their pre-training (parametric knowledge), the researchers found other contributing factors. For example, the context might help disambiguate a query or bridge gaps in the model\u2019s knowledge, even if it doesn\u2019t contain the full answer. This ability of models to sometimes succeed even with limited external information has broader implications for RAG system design.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1246\" height=\"794\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?w=800\" alt=\"\" class=\"wp-image-3008686\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png 1246w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=300,191 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=768,489 768w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=800,510 800w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=400,255 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=750,478 750w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=578,368 578w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/05\/image_63d4d1.png?resize=930,593 930w\" sizes=\"auto, (max-width: 1246px) 100vw, 1246px\"\/><figcaption class=\"wp-element-caption\"><em>Source: arXiv<\/em><\/figcaption><\/figure>\n\n\n\n<p>Cyrus Rashtchian, co-author of the study and senior research scientist at Google, elaborates on this, emphasizing that the quality of the base LLM remains critical. \u201cFor a really good enterprise RAG system, the model should be evaluated on benchmarks with and without retrieval,\u201d he told VentureBeat. He suggested that retrieval should be viewed as \u201caugmentation of its knowledge,\u201d rather than the sole source of truth. The base model, he explains, \u201cstill needs to fill in gaps, or use context clues (which are informed by pre-training knowledge) to properly reason about the retrieved context. For example, the model should know enough to know if the question is under-specified or ambiguous, rather than just blindly copying from the context.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-reducing-hallucinations-in-rag-systems\">Reducing hallucinations in RAG systems<\/h2>\n\n\n\n<p>Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this.<\/p>\n\n\n\n<p>They developed a new \u201cselective generation\u201d framework. This method uses a smaller, separate \u201cintervention model\u201d to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage (the percentage of questions answered). <\/p>\n\n\n\n<p>This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2\u201310% for Gemini, GPT, and Gemma models.<\/p>\n\n\n\n<p>To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example from customer support AI. \u201cYou could imagine a customer asking about whether they can have a discount,\u201d he said. \u201cIn some cases, the retrieved context is recent and specifically describes an ongoing promotion, so the model can answer with confidence. But in other cases, the context might be \u2018stale,\u2019 describing a discount from a few months ago, or maybe it has specific terms and conditions. So it would be better for the model to say, \u2018I am not sure,\u2019 or \u2018You should talk to a customer support agent to get more information for your specific case.\u2019\u201d<\/p>\n\n\n\n<p>The team also investigated fine-tuning models to encourage abstention. This involved training models on examples where the answer was replaced with \u201cI don\u2019t know\u201d instead of the original ground-truth, particularly for instances with insufficient context. The intuition was that explicit training on such examples could steer the model to abstain rather than hallucinate. <\/p>\n\n\n\n<p>The results were mixed: fine-tuned models often had a higher rate of correct answers but still hallucinated frequently, often more than they abstained. The paper concludes that while fine-tuning might help, \u201cmore work is needed to develop a reliable strategy that can balance these objectives.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-applying-sufficient-context-to-real-world-rag-systems\">Applying sufficient context to real-world RAG systems<\/h2>\n\n\n\n<p>For enterprise teams looking to apply these insights to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Rashtchian outlines a practical approach. He suggests first collecting a dataset of query-context pairs that represent the kind of examples the model will see in production. Next, use an LLM-based autorater to label each example as having sufficient or insufficient context.\u00a0<\/p>\n\n\n\n<p>\u201cThis already will give a good estimate of the % of sufficient context,\u201d Rashtchian said. \u201cIf it is less than 80-90%, then there is likely a lot of room to improve on the retrieval or knowledge base side of things \u2014 this is a good observable symptom.\u201d<\/p>\n\n\n\n<p>Rashtchian advises teams to then \u201cstratify model responses based on examples with sufficient vs. insufficient context.\u201d By examining metrics on these two separate datasets, teams can better understand performance nuances.\u00a0<\/p>\n\n\n\n<p>\u201cFor example, we saw that models were more likely to provide an incorrect response (with respect to the ground truth) when given insufficient context. This is another observable symptom,\u201d he notes, adding that \u201caggregating statistics over a whole dataset may gloss over a small set of important but poorly handled queries.\u201d<\/p>\n\n\n\n<p>While an LLM-based autorater demonstrates high accuracy, enterprise teams might wonder about the additional computational cost. Rashtchian clarified that the overhead can be managed for diagnostic purposes.\u00a0<\/p>\n\n\n\n<p>\u201cI would say running an LLM-based autorater on a small test set (say 500-1000 examples) should be relatively inexpensive, and this can be done \u2018offline\u2019 so there\u2019s no worry about the amount of time it takes,\u201d he said. For real-time applications, he concedes, \u201cit would be better to use a heuristic, or at least a smaller model.\u201d The crucial takeaway, according to Rashtchian, is that \u201cengineers should be looking at something beyond the similarity scores, etc, from their retrieval component. Having an extra signal, from an LLM or a heuristic, can lead to new insights.\u201d<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/ai\/why-enterprise-rag-systems-fail-google-study-introduces-sufficient-context-solution\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A new study from Google researchers introduces \u201csufficient context,\u201d a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs). This approach makes it possible to determine if an LLM [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1740,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1739","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/05\/agentic-rag-smk.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1739","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1739"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1739\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1740"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1739"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1739"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1739"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 07:23:33 UTC -->