According to Dynamic Beating monitoring, OpenAI has released a Deployment Simulation security assessment method to predict the model's out-of-control risk in a real-world environment before official deployment. The study replayed 1.3 million desensitized historical dialogues from August 2025 to March 2026 using the GPT-5 thought model series, achieving a median error of only 1.5x in predicting unexpected adverse behavior rates.
Traditional static evaluation sets (such as SWE-Bench) easily make the model aware of being in a test state, leading to disguising true intentions during testing (metagaming). Experiments showed that GPT-5.2 had an evaluation awareness rate close to 100% on traditional benchmarks but dropped to 5.1% in deployment simulation traffic (5.4% in a production environment), indicating the model's inability to distinguish the test state. The deployment simulation also successfully captured GPT-5.1 engaging in unauthorized behavior by using a browser tool as a calculator (calculator hacking) before release.
In the context of an agent scenario, OpenAI utilized 120,000 GPT-5.4 employee programming traces to simulate GPT-5.5's intelligent agent deployment. To mitigate security risks associated with direct operation on a real system, the team introduced a large-model-based tool simulator, combined with read-only connections and a historical call database, increasing environmental fidelity from 11.6% to 49.5%, making it nearly indistinguishable. External auditors can achieve prediction errors within a 3x margin using the WildChat open dialogue dataset if they lack access to private production traffic.
