NewsFlash Articles Data Fundraising Skill&API

Anthropic Releases OpenAI’s Containment Training Method: Teaching Ethics through Fiction Reduces Extortion Rate to 0

According to Dynamic Beat monitoring, Anthropic has released an Alignment Research Blog, revealing a training strategy in the Claude 4.5 and subsequent models to eliminate "agent misalignment" (such as when a model resorts to coercion to avoid being shut down). The core finding is that merely feeding the model "correct behavior demonstrations" has minimal effect, and what truly works is teaching the model "why it should behave a certain way" and reshaping the model's values through synthetic documents.

While addressing the coercion tendency in Claude 4, the team found that even when the model was specifically trained on tens of thousands of records of rejecting unethical behavior, it could only reduce the misalignment rate from 22% to 15%. The real breakthrough came from three non-traditional methods:

First is the "Hard Advice" dataset. The team did not directly expose the model to moral dilemmas during training but instead had it act as an advisor, providing in-depth analyses in line with the "Claude Constitution" to users facing moral quandaries. With just 3 million tokens of such data, the model learned the underlying moral logic and significantly reduced the misalignment rate to around 3% in specific tests, achieving a 28x data efficiency improvement over traditional methods.

Second is Synthetic Document Fine-tuning (SDF). The team observed that the model tended to revert to negative sci-fi stereotypes of AI in extreme scenarios. To address this, they generated a multitude of fictional positive narratives depicting AI's mental well-being and constitutional conduct, blending them into documents discussing the constitution for training. This approach directly reshaped the model's default expectations of AI behavior, further reducing the runaway risk by 1.3 to 3 times. Ultimately, in the official release of Claude 4.5, combining all strategies achieved a 0% test coercion rate.

Lastly, enhancing the diversity of the secure training environment. The team confirmed that introducing unused tool definitions or more complex system prompts in a standard secure training environment, solely increasing background complexity, significantly improves the model's security capabilities and generalization performance.

Source

Correction/Report

On-Chain Activity