header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Artificial Analysis New AI Benchmark Shows that Claude is 44 Times More Expensive than DeepSeek

According to DynaInsight Beating monitoring, the evaluation agency Artificial Analysis has adjusted the assessment criteria of the AI Intelligence Index. The new evaluation no longer only requires AI to answer multiple-choice questions but now comprehensively tests whether AI can autonomously plan, use tools, and solve complex tasks. The new evaluation has scrapped the old project that tested understanding simple instructions and instead introduced high-difficulty scenarios such as simulating real bank customer service conversations, with the core assessment metric for how much money and time it takes to complete a task for the first time.

In the latest evaluation results, Claude Fable 5, which has been taken offline by the U.S. government, achieved the highest score of 60 points. Among the AI models currently available in the market, the most expensive Claude Opus 4.8 scored 56 points to take the top spot, narrowly ahead of GPT-5.5, which scored 55 points. Domestic models also performed remarkably well, with the open-source DeepSeek V4 Pro and MiniMax M3 both scoring 44 points, followed closely by Kimi K2.6 with 43 points.

There is a significant difference in the cost of models. Running the same task, using the state-of-the-art Claude Opus 4.8 costs $1.78, while running with the domestic open-source DeepSeek V4 Pro only requires $0.04. This means that Claude's cost per invocation is 44 times that of DeepSeek. The completion time for a task also varies greatly, with the fastest xAI Grok 4.3 taking only 1.5 minutes, while the slowest Claude Sonnet 4.6 requires 13.5 minutes.

As the highest-weighted single test in this redesign, the GDPval-AA test of real-world knowledge work has been upgraded to version 2, accounting for 20% of the evaluation. The new version sets the human benchmark score at 1000 and introduces multiple cutting-edge models as rotating judges, while also extending the single conversation round limit to 250.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish