header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Enable Large Model "Compute-Optimized Training": NVIDIA's TwoTower Architecture parallelizes two 30B models, achieving a 2.4x speedup with no loss in fidelity

According to Dochat Beating monitoring, NVIDIA has open-sourced the Discrete Text Diffusion Architecture Nemotron-Labs-TwoTower, aiming to address the generation speed bottleneck of large models that can only process "one word at a time." Traditional text diffusion models, in pursuit of parallel output, forced a single network to balance unidirectional contextual understanding with bidirectional parallel error correction, resulting in a significant decline in model cognitive ability.

TwoTower adopts a dual-tower decoupling design: on one side, it completely freezes a pre-trained autoregressive large model as the "read-only contextual tower" to preserve full reasoning and common sense capabilities; on the other side, it separately trains a "denoising writing tower" to read context information through cross-attention at the layer level.

The writing tower utilizes a "confidence masking" mechanism, where, when predicting a block, it first writes down high-confidence words, then gradually fills in the remaining blanks, achieving parallel writing from easy to hard. With a 30B-level hybrid architecture (Mamba-Transformer MoE) model, this design only requires 1/12 of the baseline model's pretraining data volume (2.1T tokens) for adaptation, retains 98.7% of the quality, increases the actual generation speed by 2.42 times, and does not add unnecessary VRAM cache overhead. Due to the need to have both towers resident in memory, the model's static VRAM usage has slightly increased, with minor accuracy degradation still observed in extremely complex code and mathematical reasoning.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish