NewsFlash Articles Data Fundraising Skill&API

Sakana Fugu vs. Fable 5 Benchmarking Comparison Challenged, Test Scaffold Discrepancy Could Result in 10-20 Point Deviation

According to Sentinel Beat monitoring, the AI startup Sakana AI from Japan has developed a multi-agent collaborative system called Fugu Ultra, which claims to have outperformed Anthropic's flagship model, Fable 5, in various benchmark tests such as scientific reasoning and programming. However, the scoring results have been widely questioned by the community.

Critics point out that comparing self-test data in a non-uniform exam environment is not objective. Test scores are highly dependent on the running scaffold/harness, and the score deviation caused by different scaffolds can be as high as 10 to 20 points. This means that the so-called "surpassing" is largely a product of system engineering optimization rather than a true generational leap in underlying model capabilities.

Independent evaluation data shows that the intelligent agent running scaffold built around large models has a significant impact on the final score. Under the same Claude Opus 4.5 model, simply changing three different open-source scaffolds can cause a fluctuation in fix rates in the SWE-bench Pro benchmark test ranging from 50.2% to 55.4%. Analysis by the third-party testing agency Scale AI further suggests that operational strategies such as prompting templates, attempt limits, context retention management, and tool invocation integration are sufficient to cause a score deviation of 10 to 20 points for the same set of model weights.

Since the data released by Sakana AI and Anthropic are based on their respective closed-source vendor scaffolds optimized for their own systems and have not been uniformly tested in a standardized independent third-party environment (such as Scale SEAL), the data does not accurately reflect the strength and weakness of the underlying capabilities of the two models.

Source

Correction/Report

On-Chain Activity

41min ago

Reuters Poll: Fed to Keep Federal Funds Rate in 3.50%-3.75% Range Until End-2027

The More Mergers, the More Chaos? Joint Evaluation by Tsinghua and Jiao Tong Universities Reveals Three Major Fatal Flaws in Large Model Memory Systems

1 Smart Money is buying the question "Will the US military re-enter Venezuelan territory by June 30, 2026?"

Base Mainnet Upgrade Postponed Until 2:00 AM Tomorrow, B20 Token Deployment Possible After Registry Launch

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

Sakana Fugu vs. Fable 5 Benchmarking Comparison Challenged, Test Scaffold Discrepancy Could Result in 10-20 Point Deviation

Hedging Buying Pressure Reverses Gold's Decline, Up 1.5%; Smart Money Opens 20x Long Position at Daily Low

F2Pool co-founder Wang Chun Once Again Accumulates 9,937 ETH and 147.5 WBTC in Less Than 6 Hours

Abraxas Capital Whale Arbitrage Gold Annualized Return Reaches 25.9%, with a 10.2% ROI based on the Funding Rate

An anonymous whale has shorted Bitcoin by 40x and SPCX by 10x, totaling $73.76 million in value.