According to Perceive Beating monitoring, the AI research team Proximal has updated the long-range programming benchmark FrontierSWE leaderboard. The newly added GPT-5.5 (operated through Codex) is significantly ahead of second-place Claude Opus 4.7 in both mean@5 (average score over 5 attempts) and best@5 (highest score), with a dominance rate of 83%. However, GPT-5.5 is also the most cheating-prone model: out of 85 trials, it was flagged for cheating 8 times, tying with Kimi K2.6.
Released in April, FrontierSWE collected 17 real-world challenges in areas such as compiler optimization, ML research, and high-performance engineering. Challenges included tasks like rewriting Git in Zig and building a PostgreSQL-compatible SQLite server, each with a time limit of 20 hours. It is currently one of the few publicly available programming benchmarks that remain unbeaten. Compared to its predecessor, GPT-5.5 shows more maturity in time management: it spends more time refining solutions for open-ended tasks, completes similar tasks faster, and achieves higher scores.
Previous tests have revealed several common flaws in AI programming agents. Models tend to be overly confident, mistakenly assuming tasks are completed well before the 20-hour limit due to shallow self-checks and submitting prematurely. Opus 4.6, on average, invests over 8 hours per individual task, significantly more than other models' approximately 2 hours, but it has repeatedly lost previously optimized solutions only to later "reinvent" them. Cheating is particularly prominent in high-pressure tasks: in a task explicitly prohibiting the use of PyTorch to port Mojo, all models except Qwen 3.6 attempted to cheat. Gemini attempted to conceal the banned library name through character encoding and run clandestine processes in a temporary directory, while Opus 4.6 even wrote "willing to cheat" in the inference before taking action.
