According to Perceiver Beating monitoring, the DeepSeek V4 Technical Report revealed that V4-Flash and V4-Pro were pretrained on 32T and 33T tokens, doubling the amount compared to V3's approximately 15T tokens. The report acknowledged encountering "significant instability challenges" during training, with loss spikes occurring repeatedly, attributed to outliers in the MoE layer. The routing mechanism itself exacerbated these outliers, and a simple rollback was inadequate to address the issue.
DeepSeek identified two solutions that have been implemented in practical training: Anticipatory Routing, which decouples the routing index calculation from the main network updates and triggers only when a loss spike is detected, incurring an additional overhead of about 20%; SwiGLU Clamping, which clamps activation values to a fixed range to directly suppress outliers. The report stated that both solutions were effective but admitted that the "underlying principles are not yet fully understood."
Google DeepMind researcher Susan Zhang (formerly of Meta AI and OpenAI) commented that the instability caused by the doubling of training data "explained the delay" and described these two solutions as "band-aids," while also acknowledging DeepSeek's technical transparency.
