NewsFlash Articles Data Fundraising Skill&API

AI Retrospection Replay Fails Again: GPT-5.4 Reduces Perfect Score Solution to Only 54%

According to Perceive Beating monitoring, University of Illinois computer science PhD student Dylan Zhang conducted a series of Agent Memory experiments, leading to an unusual conclusion: having the model repeatedly summarize its experience may actually make its memory worse.

The most striking set of results came from ARC-AGI: researchers selected 19 questions that GPT-5.4 could answer correctly without any memory, then fed the model the true solutions to these questions while having it "summarize" the experience. In theory, this is akin to open-book revision; however, after multiple rounds of memory compression, the accuracy of the same model dropped from 100% to 54%. The original trajectory was not wrong; the actual issue arose when the model transformed the correct trajectory into generic experience.

What's even worse is that this memory degradation is not an isolated case. In the WebShop online shopping task, the AWM memory method scored 0.64 when fed 8 expert trajectories, but this score plummeted to 0.20 when the trajectories increased to 128, returning to the no-memory baseline. In other words, the more memory is piled on, the more the benefits are nullified.

The problem lies not in "too little experience" but in "summarizing too frequently." The experience recorded by large models is not an objective log; each summary is a regeneration. Eventually, specific premises are omitted, rules from different tasks are blended together, and details that could guide actions are transformed into seemingly correct yet practically useless clichés like "prioritize the most direct action" or "use the correct tool." An extreme example presented in the original text is where 50 structured memories are merged into one, multiple task differences are compressed into a single generic process, and 6 to 13 successful samples are directly discarded in the next evaluation round.

The advice given by the author is very restrained: don't rush the Agent to write a "mistake book" every round. A more stable approach is to retain the filtered original operation trajectories and only abstractly summarize when absolutely necessary. In the experiments, only preserving the original episode and disabling abstract summarization matched or surpassed compressed memory methods tested across multiple Agent benchmarks. For developers, this conclusion is straightforward: showing the model what has actually been done is usually more beneficial than making it memorize a bunch of abstract rules.

Source

Correction/Report

On-Chain Activity