header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Yifan Zhang Reveals Full Technical Specs of DeepSeek V4: 1.6T Parameters, 384 Expert Activations, 6 Cores

According to Vision One Beating monitoring, Princeton Ph.D. student Yifan Zhang updated the technical details of DeepSeek V4 on X. He teased "V4 next week" on April 19 and listed three architecture component names, tonight revealing a complete parameter table, while also disclosing for the first time the existence of a lightweight version V4-Lite with 285B parameters.

V4 has a total of 1.6T parameters. The attention mechanism is DSA2, combining two sparse attention schemes used by DeepSeek previously in V3.2, DSA (DeepSeek Sparse Attention), and NSA (Native Sparse Attention) proposed in a paper earlier this year, with head-dim 512, complemented by Sparse MQA and SWA (Sliding Window Attention). The MoE layer consists of 384 experts, with 6 active at a time, using the Fused MoE Mega-Kernel. Hyper-Connections are employed for residual connections.

Details revealed for the first time on the training side include: the optimizer uses Muon (a matrix-level optimizer that applies Newton-Schulz orthogonalization to momentum updates), pre-training context length of 32K, reinforcement learning phase using GRPO with KL divergence correction. The final context length is expanded to 1M. The modality is pure text.

Zhang is not affiliated with DeepSeek, and DeepSeek officials have not responded to the above information.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish