According to Dynamic Beating monitoring, NVIDIA and the MIT research team have released a new large language model training framework called Lightning OPD (Offline Policy Distillation). This technology, by precomputing log-probabilities of the teacher model offline, completely eliminates the need for real-time online teacher inference during traditional distillation training, resulting in a 4x training efficiency boost.
Previously, the standard On-Policy Distillation (OPD) required running both student and teacher models on a single machine. As models grew larger, this approach often led to out-of-memory (OOM) errors. Lightning OPD allocates all GPU power to the student model. In a single-node test with 8 H100 GPUs, Lightning OPD successfully distilled the Qwen3-30B-A3B-Base model (with 300 billion total parameters in a large MoE model), achieving a score of 71.0 on the AIME 2024 test; in comparison, the standard OPD on the same hardware configuration directly encountered an OOM error. At a smaller scale with the Qwen3-8B model, this framework achieved a score of 69.9 in just 30 GPU hours.
The research team highlighted a hidden prerequisite for offline distillation in their paper: "Teacher Consistency." The student model, during supervised fine-tuning (SFT) and subsequent distillation stages, must use the same teacher model. Failure to adhere to this principle introduces gradient discrepancies that ultimately degrade the model's performance.
