{"id":4130,"date":"2025-10-30T07:20:50","date_gmt":"2025-10-30T07:20:50","guid":{"rendered":"https:\/\/violethoward.com\/new\/from-static-classifiers-to-reasoning-engines-openais-new-model-rethinks-content-moderation\/"},"modified":"2025-10-30T07:20:50","modified_gmt":"2025-10-30T07:20:50","slug":"from-static-classifiers-to-reasoning-engines-openais-new-model-rethinks-content-moderation","status":"publish","type":"post","link":"https:\/\/violethoward.com\/new\/from-static-classifiers-to-reasoning-engines-openais-new-model-rethinks-content-moderation\/","title":{"rendered":"From static classifiers to reasoning engines: OpenAI\u2019s new model rethinks content moderation"},"content":{"rendered":"


\n
<\/p>\n

Enterprises, eager to ensure any AI models they use adhere to safety and safe-use<\/u> policies, fine-tune LLMs so they do not respond to unwanted queries.\u00a0<\/p>\n

However, much of the safeguarding and red teaming happens before deployment, \u201cbaking in\u201d policies before users fully test the models\u2019 capabilities in production. OpenAI<\/u> believes it can offer a more flexible option for enterprises and encourage more companies to bring in safety policies.\u00a0<\/p>\n

The company has released two open-weight models under research preview that it believes will make enterprises and models more flexible in terms of safeguards. gpt-oss-safeguard-120b and gpt-oss-safeguard-20b will be available on a permissive Apache 2.0 license. The models are fine-tuned versions of OpenAI\u2019s open-source gpt-oss, released in August<\/u>, marking the first release in the oss family since the summer.<\/p>\n

In a blog post<\/u>, OpenAI said oss-safeguard uses reasoning \u201cto directly interpret a developer-provider policy at inference time \u2014 classifying user messages, completions and full chats according to the developer\u2019s needs.\u201d<\/p>\n

The company explained that, since the model uses a chain-of-thought (CoT), developers can get explanations of the model's decisions for review.\u00a0<\/p>\n

\u201cAdditionally, the policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance," OpenAI said in its post. "This approach, which we initially developed for internal use, is significantly more flexible than the traditional method of training a classifier to indirectly infer a decision boundary from a large number of labeled examples." <\/p>\n

Developers can download both models from Hugging Face<\/u>.\u00a0<\/p>\n

Flexibility versus baking in<\/h2>\n

At the onset, AI models will not know a company\u2019s preferred safety triggers. While model providers do red-team models and platforms<\/u>, these safeguards are intended for broader use. Companies like Microsoft<\/u> and Amazon Web Services<\/u> even offer platforms<\/u> to bring guardrails to AI applications<\/u> and agents.\u00a0<\/p>\n

Enterprises use safety classifiers to help train a model to recognize patterns of good or bad inputs. This helps the models learn which queries they shouldn\u2019t reply to. It also helps ensure that the models do not drift and answer accurately.<\/p>\n

\u201cTraditional classifiers can have high performance, with low latency and operating cost," OpenAI said. "But gathering a sufficient quantity of training examples can be time-consuming and costly, and updating or changing the policy requires re-training the classifier."<\/p>\n

The models takes in two inputs at once before it outputs a conclusion on where the content fails. It takes a policy and the content to classify under its guidelines. OpenAI said the models work best in situations where:\u00a0<\/p>\n