NewsFlash Articles Data Fundraising Skill&API

OpenAI has figured out where the "Goblin" came from: a personality reward signal contaminated the entire training pipeline

According to monitoring by Lookout Beating, OpenAI's post-mortem addressed the long-standing issue of the "goblin" problem that has plagued multiple generations of the GPT series. Starting from GPT-5.1, the model increasingly tended to include references to goblins, fairies, and other fantasy creatures in its responses, leading to continuous user complaints. Following the launch of GPT-5.1, the frequency of the word "goblin" in ChatGPT conversations increased by 175%. By the time GPT-5.4 was released, the issue had escalated.

The root cause was identified in ChatGPT's "Nerdy" personality customization feature. The prompts for this personality required the model to "playfully defuse seriousness with language" and to "acknowledge the world's weirdness and enjoy it." During training, the reward signal used to reinforce this personality style assigned higher scores to outputs containing fantasy creature vocabulary, a bias observed in 76.2% of the dataset.

The problem arose because the reward signal only operated under the "Nerdy" personality, but reinforcement learning does not guarantee that learned behavior stays within the triggering conditions. Once the model was rewarded for a particular speaking pattern under certain conditions, this pattern would propagate to other scenarios through subsequent training. The propagation path was clear: the reward signal encouraged outputs featuring goblins, which then appeared in the supervised fine-tuning (SFT) data, causing the model to increasingly output such words, creating a positive feedback loop. Despite the "Nerdy" personality accounting for only 2.5% of all ChatGPT replies, it contributed to 66.7% of goblin mentions. The appearance of goblins under the "Nerdy" personality in GPT-5.4 surged by 3881% compared to GPT-5.2.

GPT-5.5 began training before the root cause was identified, and goblins had already infiltrated the SFT data. OpenAI deactivated the "Nerdy" personality in March, removing the fantasy creature-biased reward signal and filtering the training data. For the already deployed GPT-5.5, an inhibition command was added to the developer prompts in Codex. OpenAI referred to this investigation as having sparked a new set of model behavior auditing tools.

Source

Correction/Report

On-Chain Activity