According to Sentinel Beat monitoring, the AI startup Sakana AI from Japan has developed a multi-agent collaborative system called Fugu Ultra, which claims to have outperformed Anthropic's flagship model, Fable 5, in various benchmark tests such as scientific reasoning and programming. However, the scoring results have been widely questioned by the community.
Critics point out that comparing self-test data in a non-uniform exam environment is not objective. Test scores are highly dependent on the running scaffold/harness, and the score deviation caused by different scaffolds can be as high as 10 to 20 points. This means that the so-called "surpassing" is largely a product of system engineering optimization rather than a true generational leap in underlying model capabilities.
Independent evaluation data shows that the intelligent agent running scaffold built around large models has a significant impact on the final score. Under the same Claude Opus 4.5 model, simply changing three different open-source scaffolds can cause a fluctuation in fix rates in the SWE-bench Pro benchmark test ranging from 50.2% to 55.4%. Analysis by the third-party testing agency Scale AI further suggests that operational strategies such as prompting templates, attempt limits, context retention management, and tool invocation integration are sufficient to cause a score deviation of 10 to 20 points for the same set of model weights.
Since the data released by Sakana AI and Anthropic are based on their respective closed-source vendor scaffolds optimized for their own systems and have not been uniformly tested in a standardized independent third-party environment (such as Scale SEAL), the data does not accurately reflect the strength and weakness of the underlying capabilities of the two models.
