According to Dynamics Beating monitoring, the semiconductor and AI analysis firm SemiAnalysis has released a programming assistant benchmark, covering GPT-5.5, Opus 4.7, and DeepSeek V4. The key findings are as follows: GPT-5.5 is a programming model that has returned to the forefront for the first time in six months, with SemiAnalysis engineers starting to switch between Codex and Claude Code, whereas previously almost all of them used Claude exclusively. GPT-5.5, based on a new pre-training with the codename "Spud," marks the first expansion of pre-training scale by OpenAI since GPT-4.5.
A division of labor emerged during the testing: Claude handles new project planning and initial development, while Codex focuses on inference-intensive bug fixes. Codex excels in understanding data structures and logical reasoning but struggles with inferring users' vague intentions. For the same dashboard task, Claude automatically replicated the reference page layout with heavily fabricated data, while Codex skipped the layout but provided much more accurate data.
The article exposes a detail about benchmark testing: In February of this year, OpenAI wrote a blog post calling on the industry to switch to SWE-bench Pro as a new standard for programming benchmarks. However, the announcement of GPT-5.5 introduced a new benchmark called "Expert-SWE." The reason lies in the fine print at the bottom of the announcement: GPT-5.5 was surpassed by Opus 4.7 on SWE-bench Pro and was significantly below Anthropic's yet-to-be-released Mythos (77.8%).
Regarding Opus 4.7, Anthropic issued a postmortem a week after release, acknowledging three bugs in Claude Code from March to April that lasted for several weeks and affected almost all users. Prior to this, several engineers had reported a performance drop in 4.6, which was dismissed as a subjective perception. Additionally, the new tokenizer in 4.7 will lead to a maximum 35% increase in token usage, a fact that Anthropic openly acknowledges, equivalent to a hidden price hike.
DeepSeek V4 was rated as "keeping up with the latest but not leading," positioning it as the lowest-cost alternative to closed-source models. The article also states that "Claude still outperforms DeepSeek V4 Pro in the highly challenging task of Chinese writing" and comments that "Claude defeated the Chinese model in its own language."
The article introduces a key concept: when evaluating model pricing, one should consider "cost per task" rather than "cost per token." While GPT-5.5 is priced at twice that of GPT-5.4 (input $5, output $30 per million tokens), completing the same task with fewer tokens may not necessarily result in a higher actual cost. Initial data from SemiAnalysis shows that Codex has an input-output ratio of 80:1, lower than Claude Code's 100:1.
