According to Dongcha Beating monitoring, the AI evaluation agency Vals AI has released the second generation of the Financial Intelligence Agent Benchmark Test (Finance Agent v2). This is an end-to-end test simulating a junior financial analyst's workflow, including 927 expert-reviewed questions. The new version of the test has seen a significant spike in difficulty, with GPT 5.5 only achieving the top spot with a 51.76% accuracy rate, in an extremely close match with Claude Opus 4.7 (51.51%) and Claude Sonnet 4.6 (51.03%).
Unlike a single-turn Q&A, this test requires the model to autonomously seek relevant paragraphs in hundreds of pages of 10-K and 10-Q reports, deal with cross-year financial statement adjustments, and complete multi-step calculations with precise intermediate numbers. Vals AI revealed that if a "must get all correct" strict scoring standard is adopted, the accuracy rates of all cutting-edge models plummet to below 40%; in the most challenging "Financial Modeling" and "Precedent Analysis" categories, the highest score is only 23%.
In other model aspects, Kimi K2.6 ranks fifth with 44.87%, being the highest-scoring domestic model; following closely are GLM 5.1 (44.79%) and DeepSeek V4 (44.08%). Furthermore, the "Fastest Speed" label was awarded to Claude Opus 4.7 (single-run time of 360 seconds), while GLM 5.1 claimed the "Most Cost-Efficient" label (single-run cost of $0.62).
The collective decline in scores in this test (Opus 4.7 scored 64.4% in the previous generation test) proves one thing: current AI models can handle simple retrievals, but in the complex financial waters that require compliance with specific industry practices and demand a high level of numerical precision, they are far from being able to replace human analysts.
