According to Perception Beating monitoring, Princeton PhD student Yifan Zhang revealed that the next-generation flagship V4 of the Chinese AI company DeepSeek will be released next week; he listed three architecture components in the comment: Sparse Multi-Query Attention (Sparse MQA), Fused MoE Mega Kernel, and Hyper-Connections. Zhang, who did his undergraduate at Peking University's Yuanpei College and his master's at Tsinghua University's Yao Class, is currently a Princeton AI Lab Fellow. He previously worked as a research intern at ByteDance's Seed Base Model Team. He is not currently employed at DeepSeek, and the official DeepSeek team has not confirmed the release schedule.
Each of the three components corresponds to an independent direction in LLM optimization. Sparse MQA introduces sparsity on top of multi-query attention to further reduce inference computation power and memory usage in long-context scenarios; Fused MoE Mega Kernel integrates MoE's routing decision with expert matrix multiplication into a single GPU kernel, eliminating a significant amount of kernel launch and memory transfer overhead during the inference stage; Hyper-Connections generalize residual connections by replacing a single residual addition with multiple learnable weighted paths.
