According to Sentinel Beating monitoring, the SmartSpectrum AI open-source model GLM-5.2 has officially joined the DeepSWE long-range software engineering benchmark. In the maximum thinking mode, the one-shot success rate for complex development tasks has reached 44%, ranking first among open-source models. Compared to the previously listed Kimi K2.7 Code, the success rate is 13 percentage points higher.
GLM-5.2 achieves an average cost of $3.92 per task, slightly higher than Kimi K2.7 Code at $2.82, yet surpassing the performance of several mainstream closed-source models in specific thinking configurations, including Claude Sonnet 4.6 [high] (30%), Gemini 3.5 Flash [medium] (37%), and Claude Opus 4.8 [low] (41%).
The benchmark designed by the test initiator, Datacurve, specifically evaluates AI agents' ability to tackle long tasks in DeepSWE. The test consists of 113 real-world coding problems covering 5 languages. Unlike traditional tests that only modify a single line of code, DeepSWE requires AI to collaboratively edit multiple files, with an average code fix exceeding 600 lines. The evaluation runs in isolated containers, strictly limiting CPU and memory resources.
