NewsFlash Articles Data Fundraising Skill&API

9B Small-Scale Model Self-Improvement AI Skill Matches Claude Flagship Large-Scale Model

According to Dynamics Beating monitoring, large-scale AI models often undergo self-evolution by updating their external "Harness" (which includes cues, memories, skills, and tools). In a recent paper titled "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents" released by institutions such as Pennsylvania State University, UCSC, and Amazon, the evolutionary process has been decoupled into two dimensions: the "Harness Updating" of the evolver and the "Harness Benefit" at the execution end. Cross-testing has shown that the ability to update the harness exhibits a clear "flattening" feature on the model's base capabilities. The difference in benefits from updating different model harnesses is only up to 3.1%, with even the skills updated by the 9B-sized Qwen3.5-9B being highly equivalent in program structure to the flagship Claude Opus 4.6. This suggests that when developing a self-evolving system, there is no need to invest heavily in the evolver role.

On the contrary, the ability of the AI agent to benefit from the harness shows a "non-monotonic" trend. The performance of top models has reached a plateau, while weaker models (such as Qwen3-32B) have the most room for improvement but actually benefit the least. The study identifies two major failure modes in weak models. The first is "Harness Activation Failure," where weak models have a skills loading rate of only 25.1% in the SkillsBench benchmark, while strong models have a loading rate of around 96%. The second is "Harness Compliance Failure," where as the long-range execution trajectory unfolds, the instruction compliance of weak models drops from an initial loading stage of 0.52 to 0.13.

These cutting-edge findings have resonated strongly with renowned AI researcher Elvis Sar (@omarsar0). Elvis Sar points out that he observed the same phenomenon in coding AI agents and long-range task experiments: more powerful models do not always evolve into superior AI agents. This paper provides important guidance for AI agent system architecture design, indicating that the compute budget should be skewed towards executing agents and emphasizing the autonomous awakening of harness and long-range instruction compliance in AI agent training.

Source

Correction/Report

On-Chain Activity