According to Vision One Beating monitoring, Princeton Ph.D. student Yifan Zhang updated the technical details of DeepSeek V4 on X. He teased "V4 next week" on April 19 and listed three architecture component names, tonight revealing a complete parameter table, while also disclosing for the first time the existence of a lightweight version V4-Lite with 285B parameters.
V4 has a total of 1.6T parameters. The attention mechanism is DSA2, combining two sparse attention schemes used by DeepSeek previously in V3.2, DSA (DeepSeek Sparse Attention), and NSA (Native Sparse Attention) proposed in a paper earlier this year, with head-dim 512, complemented by Sparse MQA and SWA (Sliding Window Attention). The MoE layer consists of 384 experts, with 6 active at a time, using the Fused MoE Mega-Kernel. Hyper-Connections are employed for residual connections.
Details revealed for the first time on the training side include: the optimizer uses Muon (a matrix-level optimizer that applies Newton-Schulz orthogonalization to momentum updates), pre-training context length of 32K, reinforcement learning phase using GRPO with KL divergence correction. The final context length is expanded to 1M. The modality is pure text.
Zhang is not affiliated with DeepSeek, and DeepSeek officials have not responded to the above information.
