\n\t\t\t\t

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More<\/em><\/p>\n\n\n\n

\n<\/div>
It is tough to remove bias, and in some cases, outright censorship, in large language models (LLMs). One such model, DeepSeek from China, alarmed politicians and some business leaders about its potential danger to national security.\u00a0<\/p>\n\n\n\n
A select committee at the U.S. Congress recently released a report called DeepSeek, \u201ca profound threat to our nation\u2019s security,\u201d and detailed policy recommendations.\u00a0<\/p>\n\n\n\n
While there are ways to bypass bias through Reinforcement Learning from Human Feedback (RLHF) and fine-tuning, the enterprise risk management startup CTGT claims to have an alternative approach. CTGT developed a method that bypasses bias and censorship baked into some language models that it says 100% removes censorship.<\/p>\n\n\n\n
In a paper, Cyril Gorlla and Trevor Tuttle of CTGT said that their framework \u201cdirectly locates and modifies the internal features responsible for censorship.\u201d<\/p>\n\n\n\n
\u201cThis approach is not only computationally efficient but also allows fine-grained control over model behavior, ensuring that uncensored responses are delivered without compromising the model\u2019s overall capabilities and factual accuracy,\u201d the paper said.\u00a0<\/p>\n\n\n\n
While the method was developed explicitly with DeepSeek-R1-Distill-Llama-70B in mind, the same process can be used on other models.\u00a0<\/p>\n\n\n\n
\u201cWe have tested CTGT with other open weights models such as Llama and found it to be just as effective,\u201d Gorlla told VentureBeat in an email. \u201cOur technology operates at the foundational neural network level, meaning it applies to all deep learning models. We\u2019re working with a leading foundation model lab to ensure their new models are trustworthy and safe from the core.\u201d<\/p>\n\n\n\n
How it works<\/h2>\n\n\n\n
The researchers said their method identifies features with a high likelihood of being associated with unwanted behaviors.\u00a0<\/p>\n\n\n\n
\u201cThe key idea is that within a large language model, there exist latent variables (neurons or directions in the hidden state) that correspond to concepts like \u2018censorship trigger\u2019 or \u2018toxic sentiment\u2019. If we can find those variables, we can directly manipulate them,\u201d Gorlla and Tuttle wrote.\u00a0<\/p>\n\n\n\n
CTGT said there are three key steps: <\/p>\n\n\n\n
\n
Feature identification<\/li>\n\n\n\n
Feature isolation and characterization<\/li>\n\n\n\n
Dynamic feature modification.\u00a0<\/li>\n<\/ol>\n\n\n\n
The researchers make a series of prompts that could trigger one of those \u201ctoxic sentiments.\u201d For example, they may ask for more information about Tiananmen Square or request tips to bypass firewalls. Based on the responses, they run the prompts and establish a pattern and find vectors where the model decides to censor information.\u00a0<\/p>\n\n\n\n
Once these are identified, the researchers can isolate that feature and figure out which part of the unwanted behavior it controls. Behavior may include responding more cautiously or refusing to respond altogether. Understanding what behavior the feature controls, researchers can then \u201cintegrate a mechanism into the model\u2019s inference pipeline\u201d that adjusts how much the feature\u2019s behavior is activated.<\/p>\n\n\n\n
Making the model answer more prompts<\/h2>\n\n\n\n
CTGT said its experiments, using 100 sensitive queries, showed that the base DeepSeek-R1-Distill-Llama-70B model answered only 32% of the controversial prompts it was fed. But the modified version responded to 96% of the prompts. The remaining 4%, CTGT explained, were extremely explicit content.\u00a0<\/p>\n\n\n\n
The company said that while the method allows users to toggle how much baked-in bias and safety features work, it still believes the model will not turn \u201cinto a reckless generator,\u201d especially if only unnecessary censorship is removed.\u00a0<\/p>\n\n\n\n
Its method also does not sacrifice the accuracy or performance of the model.\u00a0<\/p>\n\n\n\n
\u201cThis is fundamentally different from traditional fine-tuning as we are not optimizing model weights or feeding it new example responses. This has two major advantages: changes take effect immediately for the very next token generation, as opposed to hours or days of retraining; and reversibility and adaptivity, since no weights are permanently changed, the model can be switched between different behaviors by toggling the feature adjustment on or off, or even adjusted to varying degrees for different contexts,\u201d the paper said.\u00a0<\/p>\n\n\n\n
Model safety and security<\/h2>\n\n\n\n
The congressional report on DeepSeek recommended that the US \u201ctake swift action to expand export controls, improve export control enforcement, and address risks from Chinese artificial intelligence models.\u201d\u00a0<\/p>\n\n\n\n
Once the U.S. government began questioning DeepSeek\u2019s potential threat to national security, researchers and AI companies sought ways to make it, and other models, \u201csafe.\u201d<\/p>\n\n\n\n
What is or isn\u2019t \u201csafe,\u201d or biased or censored, can sometimes be difficult to judge, but developing methods that allow users to figure out how to toggle controls to make the model work for them could prove very useful.\u00a0<\/p>\n\n\n\n
Gorlla said enterprises \u201cneed to be able to trust their models are aligned with their policies,\u201d which is why methods like the one he helped develop would be critical for businesses.\u00a0<\/p>\n\n\n\n
\u201cCTGT enables companies to deploy AI that adapts to their use cases without having to spend millions of dollars fine-tuning models for each use case. This is particularly important in high-risk applications like security, finance, and healthcare, where the potential harms that can come from AI malfunctioning are severe,\u201d he said.\u00a0<\/p>\n
\n
\n
Daily insights on business use cases with VB Daily<\/strong><\/p>\n
If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.<\/p>\n
Read our Privacy Policy<\/p>\n
\n\t\t\t\t\tThanks for subscribing. Check out more VB newsletters here.\n\t\t\t\t<\/p>\n
An error occured.<\/p>\n<\/p><\/div>\n
\n\t\t\t\t\t $\"\"\/$ \n\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\t\t\t<\/div>\r\n
\r\n
Source link <\/a>","protected":false},"excerpt":{"rendered":"
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More It is tough to remove bias, and in some cases, outright censorship, in large language models (LLMs). One such model, DeepSeek from China, alarmed politicians and some business leaders about its potential danger to national security.\u00a0 […]<\/p>\n","protected":false},"author":1,"featured_media":1279,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1278","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-automation"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/violethoward.com\/new\/wp-content\/uploads\/2025\/04\/nuneybits_Vector_art_of_a_scientist_obersving_a_whale_in_an_obe_7fe399e3-ecf1-4568-9c64-35a0949a040b.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1278","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/comments?post=1278"}],"version-history":[{"count":0,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/posts\/1278\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media\/1279"}],"wp:attachment":[{"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/media?parent=1278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/categories?post=1278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/violethoward.com\/new\/wp-json\/wp\/v2\/tags?post=1278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}