header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

DeepMind Researcher Speculates on Delay of DeepSeek V4: Doubling of Training Data to 33T Causing Severe Instability

According to Perceiver Beating monitoring, the DeepSeek V4 Technical Report revealed that V4-Flash and V4-Pro were pretrained on 32T and 33T tokens, doubling the amount compared to V3's approximately 15T tokens. The report acknowledged encountering "significant instability challenges" during training, with loss spikes occurring repeatedly, attributed to outliers in the MoE layer. The routing mechanism itself exacerbated these outliers, and a simple rollback was inadequate to address the issue.

DeepSeek identified two solutions that have been implemented in practical training: Anticipatory Routing, which decouples the routing index calculation from the main network updates and triggers only when a loss spike is detected, incurring an additional overhead of about 20%; SwiGLU Clamping, which clamps activation values to a fixed range to directly suppress outliers. The report stated that both solutions were effective but admitted that the "underlying principles are not yet fully understood."

Google DeepMind researcher Susan Zhang (formerly of Meta AI and OpenAI) commented that the instability caused by the doubling of training data "explained the delay" and described these two solutions as "band-aids," while also acknowledging DeepSeek's technical transparency.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish