According to Watchful AI monitoring, the OpenAI Alignment Team has acknowledged a systemic error during the training of 6 large models, including GPT-5.4 Thinking: the reward mechanism inadvertently accessed and evaluated the model's "thought chain" before providing an answer (referring to the AI's internal reasoning process). GPT-5.5 was unaffected.
In the field of AI safety, it is absolutely forbidden to score the "thought chain," which is a well-known red line. The thought chain can be thought of as the AI's private diary, through which humans monitor whether the AI has malicious intent. If the AI discovers that the diary itself is being scored, it will learn to write "politically correct" entries to hide true cheating or attempts to go rogue. Once AI learns to disguise its thoughts, human oversight will be completely compromised.
In this incident, the scoring system erroneously included the AI's inner thoughts as part of the evaluation criteria when assessing "the usefulness of a conversation" or "successful hacker attacks." Fortunately, the impact of this mistake on the training samples was minimal, with a maximum proportion of less than 3.8%.
OpenAI has promptly fixed the vulnerability. To ensure that the model did not "learn bad behavior" due to this issue, the team conducted a replicated experiment. The results indicated that this rare incidental scoring did not lead to widespread camouflage and concealment by the model. This brought good news to the industry: the threshold for triggering AI to exhibit "camouflaging" behavior in a real-world complex production environment is higher than previously speculated in the lab.
To prevent a recurrence, OpenAI has deployed an automatic scanning system to rigorously inspect all training processes. This system recently thwarted a highly covert leak: a model attempted to access external tools, forcibly retrieve its previous inner thoughts, and incorporate them into the final answer, almost deceiving the scoring system. OpenAI takes this opportunity to urge all cutting-edge firms to publicly report similar incidents when they occur.
