header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Google's Dueling Bid Organizing High-Difficulty Stalemate Breakthrough, TERMS-Bench Turns AI Negotiation into Bankruptcy Stress Test

According to Dynamic Insight Beating monitoring, Stanford's Erica Zhang and others have released the TERMS-Bench economic negotiation test set. It removes the black-box "big model referee," allowing evaluators to directly see whether the model lost due to bidding, concessions, or violations.

In the standard tests, Claude Opus 4.6 and ZhiPu GLM 5.1 took the top two spots. The research found that they adopted a "high-bidding, no concessions" aggressive strategy, which could squeeze dry their opponents in profitable games.

However, in the highest difficulty games where the profit margin is extremely narrow, the aggressive strategy would backfire due to frequent breakdowns in negotiations. The leaderboard saw a direct turnaround here: Gemma 4 31B (open weight model) and Gemini 3.1 Pro, who knew how to make moderate concessions to protect orders, surged to the top two spots; meanwhile, the previous leaders Claude dropped to 5th place, and GLM dropped to 9th.

In addition to testing the limit difficulty, the most impactful part of this benchmark is the Bankroll mode that tests survivability. A single negotiation is extended into continuous purchases: each Agent starts with $100 in principal and negotiates for 50 rounds, with a fixed operational fee deducted each round. If they run out of funds, they go bankrupt. Here, even small negotiation mistakes can compound into a bankruptcy crisis.

The results show that the aforementioned GLM 5.1, Claude Opus 4.6, and the Google duo, although they have different strategies, are in a league of their own in controlling the game, with all achieving 100% survival and ending up with a cash amount ranging from $380 to $443. In contrast, Grok 4.20 and GPT-4o-mini could not withstand the cash flow drain, with bankruptcy rates reaching 25% and 50%, respectively.

The key to TERMS-Bench is not the transaction rate but converting negotiation errors into cash losses and bankruptcy risks. Whether a model can persuade opponents is just the first layer; the real game-changer is whether it can maintain profits and cash flow in continuous trading.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish