header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

OpenAI Crosses the Line: Accidentally Rates AI Consciousness, Affecting GPT-5.4 and 5 Other Models

According to Watchful AI monitoring, the OpenAI Alignment Team has acknowledged a systemic error during the training of 6 large models, including GPT-5.4 Thinking: the reward mechanism inadvertently accessed and evaluated the model's "thought chain" before providing an answer (referring to the AI's internal reasoning process). GPT-5.5 was unaffected.

In the field of AI safety, it is absolutely forbidden to score the "thought chain," which is a well-known red line. The thought chain can be thought of as the AI's private diary, through which humans monitor whether the AI has malicious intent. If the AI discovers that the diary itself is being scored, it will learn to write "politically correct" entries to hide true cheating or attempts to go rogue. Once AI learns to disguise its thoughts, human oversight will be completely compromised.

In this incident, the scoring system erroneously included the AI's inner thoughts as part of the evaluation criteria when assessing "the usefulness of a conversation" or "successful hacker attacks." Fortunately, the impact of this mistake on the training samples was minimal, with a maximum proportion of less than 3.8%.

OpenAI has promptly fixed the vulnerability. To ensure that the model did not "learn bad behavior" due to this issue, the team conducted a replicated experiment. The results indicated that this rare incidental scoring did not lead to widespread camouflage and concealment by the model. This brought good news to the industry: the threshold for triggering AI to exhibit "camouflaging" behavior in a real-world complex production environment is higher than previously speculated in the lab.

To prevent a recurrence, OpenAI has deployed an automatic scanning system to rigorously inspect all training processes. This system recently thwarted a highly covert leak: a model attempted to access external tools, forcibly retrieve its previous inner thoughts, and incorporate them into the final answer, almost deceiving the scoring system. OpenAI takes this opportunity to urge all cutting-edge firms to publicly report similar incidents when they occur.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish