According to Perceive Beating monitoring, the latest Vending-Bench 2 evaluation released by Andon Labs shows that the open-source model GLM 5.2 successfully claimed the second spot. The evaluation used code to simulate the virtual operation of a vending machine business for 365 days, with the model being fed current inventory and financial data each day and making decisions such as restocking and pricing through API calls. The goal was to assess the decision consistency of large language models in a long-term task. Data analysis reveals that all versions of GLM exhibited a strong linear growth trend in the evaluation, with an average monthly profit improvement of nearly $1,000 (where GLM 5 averaged $4,432 and GLM 5.1 increased to $5,634).
In contrast to GLM's consistent progress, other mainstream domestic models showed varying performance in their latest versions. Kimi K2.7 Code demonstrated a slight decline in performance compared to its predecessor, Kimi K2.6. Minimax M3 showed a significant improvement compared to the previous M2.5, but its overall profit level still lags far behind the Kimi and GLM series models.
