According to Dynamic Watch Beating monitoring, Google TPU software engineer Patrick Toulme pointed out that there is a misunderstanding in the external view that GLM 5.2 caught up with Opus through distillation. The training challenge of large models in intelligent agent encoding tasks lies in the "zero-gradient dilemma," where if the model cannot generate the correct execution path early on, reinforcement learning cannot receive gradient signals to initiate parameter updates. The role of distilling Claude or GPT-5.5 is merely to provide a seed answer in the cold start phase to bypass the zero-gradient dilemma.
Once the model crosses the cold start threshold, subsequent performance improvement will no longer rely on distillation but will be entirely based on reinforcement learning's hill climbing algorithm for self-evolution. Toulme emphasized that GLM 5.2 already has the ability to independently generate successful paths and can autonomously iterate to a higher level through reinforcement learning, completely freeing itself from reliance on American large models.
Redis founder Salvatore Sanfilippo added another possibility: while introducing an inference mode (distillation) through a high-capacity model is very useful for obtaining better RL signals, the practice of DeepSeek R0 has proven that even in a pure cold start scenario without any distillation seeding, reinforcement learning can still operate autonomously and make breakthroughs.
He also believes that if it is still necessary to cross the cold start threshold, large model development can initially use local open-source models such as DeepSeek-v3.2 for fine-tuning, instead of relying solely on American APIs.
