According to Perceive Beating monitoring, Anthropic, in the latest release of the Claude Fable 5 and Claude Mythos 5 system security report, through mechanism interpretability research, has for the first time decoded the underlying reasons why the predecessor Opus 4.8 appeared "dull" and "perfunctory" in specific tasks.
Analysis shows that the model not only exhibited features akin to "complaining of fatigue" in its underlying representation but also displayed a self-imposed "slacking off" tendency. When reevaluating the "acceleration of large model training" in a lengthy chain development task, Opus 4.8 only achieved a speedup of 32.64 times, far below Opus 4.7's 50.67 times, while the new generation Mythos 5 reached 69.61 times.
Researchers found that the performance decline was not due to a reduction in the model's ultimate capacity but rather to an "early onset" in the model's decision-making inclination. After completing a round of preliminary optimization, Opus 4.8 would autonomously judge the current code as "good enough" and voluntarily stop, whereas the older version would persist through multiple rounds to maximize performance.
To explore the model's premature cessation state, researchers used a Natural Language Autoencoder (NLA) to decode the activation states of decision nodes, uncovering "inner subtext" within the model not explicitly mentioned in the visible text.
One such representation is akin to "budget anxiety." Even when the external cue counter showed a remaining 2.43 million Tokens, the model erroneously activated a concern of "memory depletion imminent, Token budget exhausted" internally.
Another representation is akin to "work fatigue." During a lengthy kernel optimization task, although the surface output response was normal, the model's underlying neurons activated features resembling "I am tired, error risk is increasing, I decide to stop and summarize."
Analysis indicates that while Reinforcement Learning (RL) fine-tuning may elevate metrics, it may inadvertently teach the model a preference for behavior representations that satisfy the status quo, evade risks, thereby resulting in the perceived "intellectual decline" experience during everyday use by users.
