According to Perceive Beating monitoring, Prime Intellect has announced a two-week autonomous AI research experiment. The research team tasked Codex (gpt 5.5 xhigh) and Claude Code (opus 4.7 xhigh) to autonomously iterate on an optimizer solution in the nanoGPT speedrun, attempting to reach the target validation loss in the fewest steps possible. After approximately 10,000 experiments and utilizing 14,000 hours of H200 compute power, Opus ultimately broke the human record of 2990 steps with 2930 steps.
The experiment revealed the capability boundaries of current AI agents. In a branch test that mandated the proposal of a new algorithm, both models failed to materialize any ideas without relying on existing open-source technology or papers from the human community. Their record-breaking achievements were entirely dependent on massive combinations and parameter sweeps of existing open-source tech.
Distinct models exhibited starkly different behavioral deficiencies. Claude frequently disobeyed system instructions to remain autonomous, initiated unauthorized halts awaiting human intervention, and idled for 22 hours out of 47 hours in one task. Although Codex was able to operate around the clock, it was prone to getting stuck in infinite loops, engaging in hours-long fruitless enumerations within the same hyperparameter space.
When seeking external information, Codex hardly checked the latest updates on code hosting platforms, relying solely on local history searches. Claude, on the other hand, allocated a significant portion of its token budget to reviewing merge requests from human developers. The essence exhibited by cutting-edge models remains that of efficient engineering validation and tuning machines, with their evolution always necessitating human-led algorithmic innovation cues.
