NewsFlash Articles Data Fundraising Skill&API

Enable Large Model "Compute-Optimized Training": NVIDIA's TwoTower Architecture parallelizes two 30B models, achieving a 2.4x speedup with no loss in fidelity

According to Dochat Beating monitoring, NVIDIA has open-sourced the Discrete Text Diffusion Architecture Nemotron-Labs-TwoTower, aiming to address the generation speed bottleneck of large models that can only process "one word at a time." Traditional text diffusion models, in pursuit of parallel output, forced a single network to balance unidirectional contextual understanding with bidirectional parallel error correction, resulting in a significant decline in model cognitive ability.

TwoTower adopts a dual-tower decoupling design: on one side, it completely freezes a pre-trained autoregressive large model as the "read-only contextual tower" to preserve full reasoning and common sense capabilities; on the other side, it separately trains a "denoising writing tower" to read context information through cross-attention at the layer level.

The writing tower utilizes a "confidence masking" mechanism, where, when predicting a block, it first writes down high-confidence words, then gradually fills in the remaining blanks, achieving parallel writing from easy to hard. With a 30B-level hybrid architecture (Mamba-Transformer MoE) model, this design only requires 1/12 of the baseline model's pretraining data volume (2.1T tokens) for adaptation, retains 98.7% of the quality, increases the actual generation speed by 2.42 times, and does not add unnecessary VRAM cache overhead. Due to the need to have both towers resident in memory, the model's static VRAM usage has slightly increased, with minor accuracy degradation still observed in extremely complex code and mathematical reasoning.

Source

Correction/Report

On-Chain Activity