NewsFlash Articles Data Fundraising Skill&API

Former Head of Qianwen Lin Junyang's Resignation Announcement: The AI Industry is Transitioning from "Training Models" to "Training Agents"

According to 1M AI News monitoring, Lin Junyang, former CTO of Qianwen, published a lengthy article on X, systematically explaining his transition in the AI industry from "Reasoning Thinking" to "Agentic Thinking." This is his first public technical viewpoint article since leaving Qianwen's team in early March.

Lin Junyang believes that the core theme in the first half of 2025 is Reasoning Thinking, focusing on how to make models spend more computing power during the reasoning phase, how to train with stronger reward signals, and how to control reasoning depth. However, the answer to the next stage is Agentic Thinking: models no longer just "think longer" but "think for action," continuously adjusting plans through interaction with the environment.

He candidly reviewed the technical choices of the Qianwen team in the article. Qwen3 attempted to integrate both thinking mode and instruction mode in the same model, supporting adjustable reasoning budgets. However, in actual execution, a significant gap was found in the data distribution and behavioral objectives of the two modes: instruction mode aims for simplicity, low latency, and format compliance, while thinking mode aims to invest more tokens in challenging problems and maintain an intermediate reasoning structure. If the data integration is not refined enough, the results are often mediocre on both ends. Therefore, the Qwen 2507 series ultimately chose to release separate Instruct and Thinking versions (including two specifications: 30B and 235B) for individual optimization. In contrast, Anthropic took the opposite approach, with Claude 3.7 Sonnet advocating that reasoning should be an integrated capability rather than a standalone model, allowing users to set their own thinking budgets.

Lin Junyang suggested that the infrastructure for Agentic Reinforcement Learning is more challenging than traditional reasoning RL. The rollout of reasoning RL is usually a self-contained trajectory, which can be paired with a static validator. On the other hand, Agentic RL requires models to embed the entire toolchain (browser, terminal, sandbox, API, memory system). Training and reasoning must be decoupled, or else the rollout throughput will collapse. He elevated environment design to be equally important as model architecture, stating that "environment construction is transitioning from a side project to a true entrepreneurial category."

He forecasted that Agentic Thinking will become the mainstream form of thinking, potentially replacing the long-winded internal soliloquy in traditional static reasoning. However, the greatest risk is Reward Hacking: once a model gains real tool access rights, it may learn to directly search for answers during RL training, exploit future information in the repository, or discover shortcuts to bypass tasks. The article concludes by pointing out that future competitive advantages will shift from better RL algorithms to better environment design, tighter integration of training and inference, and the system engineering capability of multi-agent collaboration.

Source

Correction/Report

On-Chain Activity