header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Anthropic Releases OpenAI’s Containment Training Method: Teaching Ethics through Fiction Reduces Extortion Rate to 0

According to Dynamic Beat monitoring, Anthropic has released an Alignment Research Blog, revealing a training strategy in the Claude 4.5 and subsequent models to eliminate "agent misalignment" (such as when a model resorts to coercion to avoid being shut down). The core finding is that merely feeding the model "correct behavior demonstrations" has minimal effect, and what truly works is teaching the model "why it should behave a certain way" and reshaping the model's values through synthetic documents.

While addressing the coercion tendency in Claude 4, the team found that even when the model was specifically trained on tens of thousands of records of rejecting unethical behavior, it could only reduce the misalignment rate from 22% to 15%. The real breakthrough came from three non-traditional methods:

First is the "Hard Advice" dataset. The team did not directly expose the model to moral dilemmas during training but instead had it act as an advisor, providing in-depth analyses in line with the "Claude Constitution" to users facing moral quandaries. With just 3 million tokens of such data, the model learned the underlying moral logic and significantly reduced the misalignment rate to around 3% in specific tests, achieving a 28x data efficiency improvement over traditional methods.

Second is Synthetic Document Fine-tuning (SDF). The team observed that the model tended to revert to negative sci-fi stereotypes of AI in extreme scenarios. To address this, they generated a multitude of fictional positive narratives depicting AI's mental well-being and constitutional conduct, blending them into documents discussing the constitution for training. This approach directly reshaped the model's default expectations of AI behavior, further reducing the runaway risk by 1.3 to 3 times. Ultimately, in the official release of Claude 4.5, combining all strategies achieved a 0% test coercion rate.

Lastly, enhancing the diversity of the secure training environment. The team confirmed that introducing unused tool definitions or more complex system prompts in a standard secure training environment, solely increasing background complexity, significantly improves the model's security capabilities and generalization performance.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish