According to 1M AI News monitoring, when an AI programming agent tackles a single task, running it multiple times often yields different solutions, some correct and some incorrect. If the best solution can be automatically selected, the overall success rate can exceed that of a single run. The question is how to choose: employing another model as a judge to score (i.e., LLM-as-a-Judge) is the current mainstream approach, but the scoring granularity is too coarse, often assigning the same score to different solutions, making it unable to distinguish the quality.
Stanford AI Lab and Berkeley Sky Computing Lab, in collaboration with NVIDIA, propose LLM-as-a-Verifier to improve this selection process. Instead of only considering the final score given by the judge, it looks at the model's probability distribution at each rating level to calculate a continuous reward value. It also has the judge reassess multiple times to average out any random bias, and decomposes the overall assessment into three independent dimensions (task compliance, correct output format, presence of error signals) for individual validation. In the experiment, Gemini 2.5 Flash was used as the verifier, achieving a single-run validation accuracy of 74.7%, compared to the traditional judge's 57.0%; after 16 repetitions, the Verifier reached 77.4%, while the Judge was at 70.2%. The traditional Judge had a draw rate of 26.5%, whereas the Verifier had a 0% draw rate in all configurations.
Real-world impact: on Terminal-Bench 2, running GPT-5.4 5 times on the same task had a success rate of 81.8% when choosing one randomly, which increased to 86.4% when selected by the Verifier. On SWE-Bench Verified, selecting 1 solution each from Claude Opus 4.5, Claude Opus 4.6, and Gemini 3 Flash (3 solutions in total) increased the success rate from 76.1% to 77.8%. As of the April 9th release date, both benchmarks are leading in performance. The framework has been open-sourced.
