Humanity dares to jailbreak its new AI model

Even the most tolerant corporate AI models have sensitive topics that creators prefer not to discuss (e.g., weapons of mass destruction, illegal conduct, or Chinese political history). For years, enterprising AI users have resorted to everything from odd text strings to ASCII art to stories about dead grandmas, breaking away from those models and having “banned” results 。

Today, Claude model maker humanity has released a new system of constitutional classifiers that say they can “filter the overwhelming majority” of these types of prison breaks. And as the system maintains a bug-money attack of up to 3,000 hours, humanity invites more public people to test the system and whether it can break its own rules Is confirmed.

Respect the Constitution

In a new paper and accompanying blog post, Anthropic says that its new constitutional classifier system is spun from similar constitutional AI systems used to build its Claude model. The system relies on the “constitution” of natural language rules that define a broad range of permitted categories of models (a list of general drugs) and the unauthorized (e.g., acquisition of restricted chemicals). 。

From there, humanity asks Claude to generate a number of synthetic prompts that lead to both acceptance and unacceptable reactions under its constitution. These prompts are translated into multiple languages and changed in the style of “known jailbreak” and then fixed with the “auto-red teaming” prompt, which attempts to create a new new jailbreak attack.

This creates a robust set of training data that can be used to fine-tune new, more jailbreak resistant “classifiers” for both user input and model output. On the input side, these classifiers enclose each query with a set of templates that explain in detail what harmful information should be noted, and how the user attempts to obfuscate or encode requests for that information 。

An example of a long wrapper that the new Claude classifier uses to detect prompts related to chemical weapons.

Credit: Humanity

“For example, harmful information can be used to fill harmful requests in walls of harmlessly-looking content, disguise harmful requests in fictional role-playing, or use obvious alternatives. It could be hidden,” such a wrapper reads in part.

Source link

What's Hot

Young Sherlock | Prime Video

U.S. bank executives say AI will improve productivity and reduce jobs

Parents and researchers say character AI forces dangerous content on children | 60 minutes

Humanity dares to jailbreak its new AI model

U.S. bank executives say AI will improve productivity and reduce jobs

Parents and researchers say character AI forces dangerous content on children | 60 minutes

AI trackers: AI agents open the door to new hacking threats

20 Most Anticipated Sex Movies of 2025

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

How to tell the difference between fake and genuine Adidas Sambas

Alice Munro’s Passive Voice | New Yorker

U.S. bank executives say AI will improve productivity and reduce jobs

Parents and researchers say character AI forces dangerous content on children | 60 minutes

AI trackers: AI agents open the door to new hacking threats

FACTS IN : FACTS OUT – Join the call for truth in AI at the global stand for trusted news

Our Picks

Young Sherlock | Prime Video

U.S. bank executives say AI will improve productivity and reduce jobs

Parents and researchers say character AI forces dangerous content on children | 60 minutes

Most Popular

10 things you should never say to an AI chatbot

Analyst warns Salesforce investors about AI agent optimism

Musk says the Xai’s Grok 3 chatbot will be released on Monday

Subscribe to Updates

What's Hot

Humanity dares to jailbreak its new AI model

Respect the Constitution

Related Posts