According to Perceive Beating monitoring, in the latest release of the AI R&D automation evaluation PostTrainBench, the inference model GLM 5.2 Max took first place with a score of 34.29%, narrowly edging out Claude Opus 4.8 Max with 34.08%.
The evaluation simulated the end-to-end process of autonomous execution after training fine-tuning on large models under a 10-hour and single-card H100 compute limit, including data cleaning, writing training scripts, and hyperparameter optimization. Out of 84 complete runs, GLM 5.2 achieved a 0% run crash rate, while the Claude Opus series Agent experienced around a 10% task hang or crash rate.
Analysis shows that the next-generation inference model can more accurately parse terminal errors, self-heal environments and training script issues, and launch larger parameter amount local teacher models (such as 14B to 72B Qwen) on local GPUs for dynamic synthetic data distillation, thus circumventing the logic deadlock of traditional agent long-duration tasks.
