According to Perceive Beating monitoring, the open-source large model GLM 5.2 has demonstrated a high cost-effectiveness in academic reproducibility testing. The alphaXiv research platform team used automated agents to test the large model's ability to reproduce cutting-edge papers. When reproducing the self-distillation for policy optimization paper (SDPO), GLM 5.2 incurred a running cost that was only about one-eighth of the cost of the closed-source flagship model Claude Opus 4.8 Max.
The experiment required the model to autonomously read the paper, troubleshoot the complex environment errors from the VeRL open-source library, and complete the ablation study. GLM 5.2 successfully reproduced after 14 failed runs, consuming 2.65 million tokens, with a total cost of $6.21. Claude Opus 4.8 Max succeeded after 9 failures, consuming 4.53 million tokens, resulting in a cost of $46.35.
