header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

DeepSeek, an open-source inference acceleration framework, has launched DSpark to boost V4 model inference speed by up to 85%.

According to ThinkTank Beating monitoring, DeepSeek, in collaboration with Peking University, has released a technical report on the Speculative Token Sampling Acceleration Framework DSpark and open-sourced the full-stack codebase DeepSpec. Currently, DSpark has been deployed in the DeepSeek-V4 online service. While ensuring lossless output, DSpark has increased the Flash version's single-user generation speed by 60% to 85% and the Pro version's speed by 57% to 78%. DSpark has outperformed the original Single Token Multi-Branch Prediction (MTP-1) baseline and significantly increased the overall system throughput under strict latency constraints.

Previously, it was challenging to implement Multi-Token Speculative Sampling in an online production environment. Regression draft model generation was too slow, and parallel draft models resulted in a very low acceptance rate for the latter half of long sequences due to independently predicting each position. Blindly validating multi-token drafts under high concurrency would cause large models to waste a significant amount of computational power verifying typos that were destined for rejection, leading to a severe breakdown in overall system throughput. Therefore, the industry has mostly been limited to Single Token Prediction (MTP-1) online.

DSpark has overcome the throughput degradation bottleneck under high concurrency. DSpark first utilizes DFlash to generate the main spine network in parallel, followed by the addition of an extremely lightweight Markov header. The Markov header, through table lookup and a single matrix multiplication, serially injects the correlation of adjacent words at a very low cost. Simultaneously, the system integrates confidence prediction headers with a post-calibration algorithm. To perfectly support zero-cost scheduling in a production environment and prevent future information leaks, the scheduler adopts an asynchronous mechanism, utilizing historical predictions from two steps prior to dynamically determine the candidate word trimming length, entirely preventing large models from validating high-risk tail typos under heavy loads.

In addition to DSpark, the DeepSeek open-sourced DeepSpec codebase provides built-in support for open-source large models like Qwen3 and Gemma. DeepSpec offers a complete Python toolchain from downloading prompt words, reconstructing the large model cache, training draft models to benchmark evaluation. Developers can directly use open-source scripts to customize and deploy dedicated acceleration modules for various open-source large models locally.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish