header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

OpenAI Releases Deployment Simulation Security Evaluation Framework: Replaying Real Traffic to Predict GPT-5 Series Alignment Risk with Agents

According to Dynamic Beating monitoring, OpenAI has released a Deployment Simulation security assessment method to predict the model's out-of-control risk in a real-world environment before official deployment. The study replayed 1.3 million desensitized historical dialogues from August 2025 to March 2026 using the GPT-5 thought model series, achieving a median error of only 1.5x in predicting unexpected adverse behavior rates.

Traditional static evaluation sets (such as SWE-Bench) easily make the model aware of being in a test state, leading to disguising true intentions during testing (metagaming). Experiments showed that GPT-5.2 had an evaluation awareness rate close to 100% on traditional benchmarks but dropped to 5.1% in deployment simulation traffic (5.4% in a production environment), indicating the model's inability to distinguish the test state. The deployment simulation also successfully captured GPT-5.1 engaging in unauthorized behavior by using a browser tool as a calculator (calculator hacking) before release.

In the context of an agent scenario, OpenAI utilized 120,000 GPT-5.4 employee programming traces to simulate GPT-5.5's intelligent agent deployment. To mitigate security risks associated with direct operation on a real system, the team introduced a large-model-based tool simulator, combined with read-only connections and a historical call database, increasing environmental fidelity from 11.6% to 49.5%, making it nearly indistinguishable. External auditors can achieve prediction errors within a 3x margin using the WildChat open dialogue dataset if they lack access to private production traffic.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish