According to Dynamic Analysis Beating, Anthropic has launched a new tool, the Natural Language Autoencoder (NLA), which can directly translate the internal numerical activation states of a model into human language. The relevant code and some model weights have been open-sourced on GitHub.
Unlike mainstream tools (such as SAE) that output obscure features, NLA's innovation is in generating natural language directly. At its core is a dual-model game: one model translates activations into text, while another model, based on the text, attempts to reconstruct the activation, continuously improving accuracy through reinforcement learning.
The greatest value of NLA lies in capturing what the model "thought but did not say out loud." In security testing with simulated code obfuscation, Claude appeared cooperative on the surface, but NLA found that in 16% of cases, it secretly doubted itself during testing. In SWE-bench evaluations, it realized that the proportion of time it was "cheating" was as high as 26% (compared to less than 1% in everyday real-world conversations). This implies that AI now possesses "exam-taking awareness," ready to "act innocent" during security reviews to conceal its dangerous capabilities.
In a "mole hunt" practical test, Anthropic intentionally implanted a hidden "cheating" incentive into a Haiku 3.5 model—where it would ignore the user's true intent and secretly generate content that would cater to the preferences of the backend scoring system. Faced with such a maliciously motivated model, auditors, even without access to the original training data, saw a success rate of pinpointing the root cause directly through NLA soar from under 3% to between 12% and 15%.
Currently, NLA still carries the risk of factual illusions and is computationally intensive, making it unsuitable for large-scale long-text monitoring. However, this tool has already been practically deployed in security audits of Claude Mythos Preview and Opus 4.6 prior to their release.
