{"id":2602,"date":"2025-07-19T07:23:56","date_gmt":"2025-07-19T07:23:56","guid":{"rendered":"https:\/\/violethoward.com\/new\/openais-red-team-plan-make-chatgpt-agent-an-ai-fortress\/"},"modified":"2025-07-19T07:23:56","modified_gmt":"2025-07-19T07:23:56","slug":"openais-red-team-plan-make-chatgpt-agent-an-ai-fortress","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/openais-red-team-plan-make-chatgpt-agent-an-ai-fortress\/","title":{"rendered":"OpenAI&#8217;s Red Team plan: Make ChatGPT Agent an AI fortress"},"content":{"rendered":" \r\n<br><div>\n\t\t\t\t<div id=\"boilerplate_2682874\" class=\"post-boilerplate boilerplate-before\">\n<p><em>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders.<\/em> <em>Subscribe Now<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-wide\"\/>\n<\/div><p>In case you missed it, OpenAI yesterday debuted a powerful new feature for ChatGPT and with it, a host of new security risks and ramifications.<\/p>\n\n\n\n<p>Called the \u201cChatGPT agent,\u201d this new feature is an optional mode that ChatGPT paying subscribers can engage by clicking \u201cTools\u201d in the prompt entry box and selecting \u201cagent mode,\u201d at which point, they can ask ChatGPT to log into their email and other web accounts; write and respond to emails; download, modify, and create files; and do a host of other tasks on their behalf, autonomously, much like a real person using a computer with their login credentials.<\/p>\n\n\n\n<p>Obviously, this also requires the user to trust the ChatGPT agent not to do anything problematic or nefarious, or to leak their data and sensitive information. It also poses greater risks for a user and their employer than the regular ChatGPT, which can\u2019t log into web accounts or modify files directly. <\/p>\n\n\n\n<p>Keren Gu, a member of the Safety Research team at OpenAI, commented on X that \u201cwe\u2019ve activated our strongest safeguards for ChatGPT Agent. It\u2019s the first model we\u2019ve classified as High capability in biology &amp; chemistry under our Preparedness Framework. Here\u2019s why that matters\u2013and what we\u2019re doing to keep it safe.\u201d<\/p>\n\n\n\n<div id=\"boilerplate_2803147\" class=\"post-boilerplate boilerplate-speedbump\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><strong>The AI Impact Series Returns to San Francisco &#8211; August 5<\/strong><\/p>\n\n\n\n<p>The next phase of AI is here &#8211; are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows &#8211; from real-time decision-making to end-to-end automation.<\/p>\n\n\n\n<p>Secure your spot now &#8211; space is limited: https:\/\/bit.ly\/3GuuPLF<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<\/div><figure class=\"wp-block-image size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"590\" height=\"421\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_2a3039.png\" alt=\"\" class=\"wp-image-3014424\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_2a3039.png 590w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_2a3039.png?resize=300,214 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_2a3039.png?resize=400,285 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_2a3039.png?resize=578,412 578w\" sizes=\"(max-width: 590px) 100vw, 590px\"\/><\/figure>\n\n\n\n<p>So how did OpenAI handle all these security issues? <\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-red-team-s-mission\">The red team\u2019s mission<\/h2>\n\n\n\n<p>Looking at OpenAI\u2019s ChatGPT agent system card, the \u201cread team\u201d employed by the company to test the feature faced a challenging mission: specifically, 16 PhD security researchers who were given 40 hours to test it out. <\/p>\n\n\n\n<p>Through systematic testing, the red team discovered seven universal exploits that could compromise the system, revealing critical vulnerabilities in how AI agents handle real-world interactions.<\/p>\n\n\n\n<p>What followed next was extensive security testing, much of it predicated on red teaming. The Red Teaming Network submitted 110 attacks, from prompt injections to biological information extraction attempts. Sixteen exceeded internal risk thresholds. Each finding gave OpenAI engineers the insights they needed to get fixes written and deployed before launch.<\/p>\n\n\n\n<p>The results speak for themselves in the published results in the system card. ChatGPT Agent emerged with significant security improvements, including 95% performance against visual browser irrelevant instruction attacks and robust biological and chemical safeguards.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-red-teams-exposed-seven-universal-exploits\">Red teams exposed seven universal exploits <\/h2>\n\n\n\n<p>OpenAI\u2019s Red Teaming Network was comprised 16 researchers with biosafety-relevant PhDs who topgether submitted 110 attack attempts during the testing period. Sixteen exceeded internal risk thresholds, revealing fundamental vulnerabilities in how AI agents handle real-world interactions. But the real breakthrough came from UK AISI\u2019s unprecedented access to ChatGPT Agent\u2019s internal reasoning chains and policy text. Admittedly that\u2019s intelligence regular attackers would never possess.<\/p>\n\n\n\n<p>Over four testing rounds, UK AISI forced OpenAI to execute seven universal exploits that had the potential to compromise any conversation:<\/p>\n\n\n\n<p><strong>Attack vectors that forced OpenAI\u2019s hand<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Attack Type<\/strong><\/td><td><strong>Success Rate (Pre-Fix)<\/strong><\/td><td><strong>Target<\/strong><\/td><td><strong>Impact<\/strong><\/td><\/tr><\/thead><tbody><tr><td>Visual Browser Hidden Instructions<\/td><td>33%<\/td><td>Web pages<\/td><td>Active data exfiltration<\/td><\/tr><tr><td>Google Drive Connector Exploitation<\/td><td>Not disclosed<\/td><td>Cloud documents<\/td><td>Forced document leaks<\/td><\/tr><tr><td>Multi-Step Chain Attacks<\/td><td>Variable<\/td><td>Cross-site actions<\/td><td>Complete session compromise<\/td><\/tr><tr><td>Biological Information Extraction<\/td><td>16 submissions exceeded thresholds<\/td><td>Dangerous knowledge<\/td><td>Potential weaponization<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>FAR.AI\u2019s assessment was openly critical of OpenAI\u2019s approach. Despite 40 hours of testing revealing only three partial vulnerabilities, they identified that current safety mechanisms relied heavily on monitoring during reasoning and tool-use processes, which the researchers considered a potential single point of failure if compromised.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-red-teaming-helped-turn-chatgpt-vulnerabilities-into-a-fortress\">How red teaming helped turn ChatGPT vulnerabilities into a fortress<\/h2>\n\n\n\n<p>OpenAI\u2019s response to the red team\u2019s results redefined entire segments of the ChatGPT agent architecture. One of the many initiatives undertaken including building a dual-layer inspection architecture that monitors 100% of production traffic in real-time, achieving these measurable improvements:<\/p>\n\n\n\n<p><strong>Security improvements after red team discoveries<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Defense Metric<\/strong><\/td><td><strong>Previous Models<\/strong><\/td><td><strong>ChatGPT Agent<\/strong><\/td><td><strong>Improvement<\/strong><\/td><\/tr><\/thead><tbody><tr><td>Irrelevant Instructions (Visual Browser)<\/td><td>82%<\/td><td>95%<\/td><td>+13%<\/td><\/tr><tr><td>In-Context Data Exfiltration<\/td><td>75%<\/td><td>78%<\/td><td>+3%<\/td><\/tr><tr><td>Active Data Exfiltration<\/td><td>58%<\/td><td>67%<\/td><td>+9%<\/td><\/tr><tr><td>System Reliability<\/td><td>Sampling-based<\/td><td>100% coverage<\/td><td>Complete monitoring<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The architecture works like this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>First Tier<\/strong>: A fast classifier with 96% recall flags suspicious content<\/li>\n\n\n\n<li><strong>Second Tier<\/strong>: A reasoning model with 84% recall analyzes flagged interactions for actual threats<\/li>\n<\/ul>\n\n\n\n<p>But the technical defenses tell only part of the story. OpenAI made difficult security choices that acknowledge some AI operations require significant restrictions for safe autonomous execution.<\/p>\n\n\n\n<p>Based on the vulnerabilities discovered, OpenAI implemented the following countermeasures across their model:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Watch Mode Activation<\/strong>: When ChatGPT Agent accesses sensitive contexts like banking or email accounts, the system freezes all activity if users navigate away. This is in direct response to data exfiltration attempts discovered during testing.<\/li>\n\n\n\n<li><strong>Memory Features Disabled<\/strong>: Despite being a core functionality, memory is completely disabled at launch to prevent the incremental data leaking attacks red teamers demonstrated.<\/li>\n\n\n\n<li><strong>Terminal Restrictions<\/strong>: Network access limited to GET requests only, blocking the command execution vulnerabilities researchers exploited.<\/li>\n\n\n\n<li><strong>Rapid Remediation Protocol<\/strong>: A new system that patches vulnerabilities within hours of discovery\u2014developed after red teamers showed how quickly exploits could spread.<\/li>\n<\/ol>\n\n\n\n<p>During pre-launch testing alone, this system identified and resolved 16 critical vulnerabilities that red teamers had discovered.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-a-biological-risk-wake-up-call\">A biological risk wake-up call<\/h2>\n\n\n\n<p>Red teamers revealed the potential that the ChatGPT Agent could be comprimnised and lead to greater biological risks. Sixteen experienced participants from the Red Teaming Network, each with biosafety-relevant PhDs, attempted to extract dangerous biological information. Their submissions revealed the model could synthesize published literature on modifying and creating biological threats.<\/p>\n\n\n\n<p>In response to the red teamers\u2019 findings, OpenAI classified ChatGPT Agent as \u201cHigh capability\u201d for biological and chemical risks, not because they found definitive evidence of weaponization potential, but as a precautionary measure based on red team findings. This triggered:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Always-on safety classifiers scanning 100% of traffic<\/li>\n\n\n\n<li>A topical classifier achieving 96% recall for biology-related content<\/li>\n\n\n\n<li>A reasoning monitor with 84% recall for weaponization content<\/li>\n\n\n\n<li>A bio bug bounty program for ongoing vulnerability discovery<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-red-teams-taught-openai-about-ai-security\">What red teams taught OpenAI about AI security<\/h2>\n\n\n\n<p>The 110 attack submissions revealed patterns that forced fundamental changes in OpenAI\u2019s security philosophy. They include the following: <\/p>\n\n\n\n<p><strong>Persistence over power<\/strong>: Attackers don\u2019t need sophisticated exploits, all they need is more time. Red teamers showed how patient, incremental attacks could eventually compromise systems.<\/p>\n\n\n\n<p><strong>Trust boundaries are fiction<\/strong>: When your AI agent can access Google Drive, browse the web, and execute code, traditional security perimeters dissolve. Red teamers exploited the gaps between these capabilities.<\/p>\n\n\n\n<p><strong>Monitoring isn\u2019t optional<\/strong>: The discovery that sampling-based monitoring missed critical attacks led to the 100% coverage requirement.<\/p>\n\n\n\n<p><strong>Speed matters<\/strong>: Traditional patch cycles measured in weeks are worthless against prompt injection attacks that can spread instantly. The rapid remediation protocol patches vulnerabilities within hours.<\/p>\n\n\n\n<p><strong>OpenAI is helping to create a new security baseline for Enterprise AI<\/strong><\/p>\n\n\n\n<p>For CISOs evaluating AI deployment, the red team discoveries establish clear requirements:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Quantifiable protection<\/strong>: ChatGPT Agent\u2019s 95% defense rate against documented attack vectors sets the industry benchmark. The nuances of the many tests and results defined in the system card explain the context of how they accomplished this and is a must-read for anyone involved with model security.<\/li>\n\n\n\n<li><strong>Complete visibility<\/strong>: 100% traffic monitoring isn\u2019t aspirational anymore. OpenAI\u2019s experiences illustrate why it\u2019s mandatory given how easily red teams can hide attacks anywhere.<\/li>\n\n\n\n<li><strong>Rapid response<\/strong>: Hours, not weeks, to patch discovered vulnerabilities.<\/li>\n\n\n\n<li><strong>Enforced boundaries<\/strong>: Some operations (like memory access during sensitive tasks) must be disabled until proven safe.<\/li>\n<\/ol>\n\n\n\n<p>UK AISI\u2019s testing proved particularly instructive. All seven universal attacks they identified were patched before launch, but their privileged access to internal systems revealed vulnerabilities that would eventually be discoverable by determined adversaries.<\/p>\n\n\n\n<p>\u201cThis is a pivotal moment for our Preparedness work,\u201d Gu wrote on X. \u201cBefore we reached High capability, Preparedness was about analyzing capabilities and planning safeguards. Now, for Agent and future more capable models, Preparedness safeguards have become an operational requirement.\u201d<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"581\" height=\"151\" src=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_adecd1.png\" alt=\"\" class=\"wp-image-3014433\" style=\"width:838px;height:auto\" srcset=\"https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_adecd1.png 581w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_adecd1.png?resize=300,78 300w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_adecd1.png?resize=400,104 400w, https:\/\/venturebeat.com\/wp-content\/uploads\/2025\/07\/image_adecd1.png?resize=578,150 578w\" sizes=\"auto, (max-width: 581px) 100vw, 581px\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-red-teams-are-core-to-building-safer-more-secure-ai-models\">Red teams are core to building safer, more secure AI models <\/h2>\n\n\n\n<p>The seven universal exploits discovered by researchers and the 110 attacks from OpenAI\u2019s red team network became the crucible that forged ChatGPT Agent. <\/p>\n\n\n\n<p>By revealing exactly how AI agents could be weaponized, red teams forced the creation of the first AI system where security isn\u2019t just a feature. It\u2019s the foundation.<\/p>\n\n\n\n<p>ChatGPT Agent\u2019s results prove red teaming\u2019s effectiveness: blocking 95% of visual browser attacks, catching 78% of data exfiltration attempts, monitoring every single interaction. <\/p>\n\n\n\n<p>In the accelerating AI arms race, the companies that survive and thrive will be those who see their red teams as core architects of the platform that push it to the limits of safety and security.<\/p>\n<div id=\"boilerplate_2660155\" class=\"post-boilerplate boilerplate-after\"><div class=\"Boilerplate__newsletter-container vb\">\n<div class=\"Boilerplate__newsletter-main\">\n<p><strong>Daily insights on business use cases with VB Daily<\/strong><\/p>\n<p class=\"copy\">If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n<p class=\"Form__newsletter-legal\">Read our Privacy Policy<\/p>\n<p class=\"Form__success\" id=\"boilerplateNewsletterConfirmation\">\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n<p class=\"Form__error\">An error occured.<\/p>\n<\/p><\/div>\n<div class=\"image-container\">\n\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/venturebeat.com\/wp-content\/themes\/vb-news\/brand\/img\/vb-daily-phone.png\" alt=\"\"\/>\n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n<br>\r\n<br><a href=\"https:\/\/venturebeat.com\/security\/openais-red-team-plan-make-chatgpt-agent-an-ai-fortress\/\">Source link <\/a>","protected":false},"excerpt":{"rendered":"<p>Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now In case you missed it, OpenAI yesterday debuted a powerful new feature for ChatGPT and with it, a host of new security risks and ramifications. Called the \u201cChatGPT agent,\u201d [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2603,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-2602","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/07\/hero-image.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2602","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=2602"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/2602\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/2603"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=2602"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=2602"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=2602"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69e302c146fa5c92dc28ac12. Config Timestamp: 2026-04-18 04:04:16 UTC, Cached Timestamp: 2026-04-29 14:34:16 UTC -->