header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Agent Licensing Exam: Fable 5's Most Difficult Mission Still Unresolved, Single Question Cost Increased by 4 to 12 Times

According to VisionOne Beating monitoring, the University of California, Berkeley RDI, led by a coalition of hundreds of industry experts, has launched a new AI agent assessment benchmark called Agents' Last Exam (ALE). This benchmark is designed to evaluate an agent's ability to perform real-world digital professional work. ALE covers 55 subfields of digital professions and includes over 1500 validation tasks derived from real projects by human experts, supporting result verification in both GUI and CLI interactive environments.

The initial tests covered cutting-edge systems like Fable 5, GPT-5.5, and Composer 2.5. The latest official website comparison metric shows that in the most challenging tasks requiring continuous reasoning and deep expertise, all tested AI agents had a success rate of 0%. The newly released Fable 5 also performed poorly. This is mainly due to the evaluation triggering security policies, with approximately 35% of Fable 5 tasks being rolled back to run on the old version Opus 4.8, resulting in overall performance well below other standout rankings. In terms of single-task API cost, Fable 5 is around $15.70, significantly higher than GPT-5.5 at $3.80 and Composer 2.5 at $1.33, making it 4 to 12 times more expensive for the same task. The tests also found that the most common reason for AI agent failure is premature declaration of success, hastily finishing tasks without actual result verification, even omitting files or miscalculating data.

For command-line agents, the assessment team simultaneously released the ALE-CLI subset. Compared to existing Terminal-Bench and SWE-bench-Pro, ALE-CLI covers 40 subfields, with individual task human average completion times ranging from several hours to weeks. In command-line evaluations, the best-performing AI agent had a success rate of only 25.2%. The evaluation team pointed out that the era of user-friendly AI agents has arrived, but there is still a long way to go before they can truly replace humans in the workforce.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish