header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

DeepSeek V4 Post-Training Overhaul: OPD Replacing Hybrid RL, Over a Dozen Expert Models Distilled into One

According to Insightful Beating monitoring, there has been a significant change in the training methodology of DeepSeek V4: the mixed RL phase of V3.2 has been completely replaced by On-Policy Distillation (OPD).

The new process consists of two steps. In the first step, focusing on domains such as mathematics, code, agent, and instruction following, domain expert models are separately trained on the V3.2 pipeline, with each expert undergoing fine-tuning followed by reinforcement learning using GRPO. In the second step, a multi-teacher OPD distills the abilities of over a dozen experts into a unified model: the student, on its self-generated trajectories, performs reverse KL divergence full vocabulary logit distillation for each teacher, aligning at the logits level to consolidate multiple expert weights into a unified parameter space, thereby avoiding traditional weight merging and common capability conflicts seen in mixed RL.

The report also introduces the Generative Reward Model (GRM): for tasks that are difficult to validate with rules, instead of training a traditional scalar reward model, a GRM trained on RL data guided by rubric is used, enabling the actor network to simultaneously generate and assess, generalizing to complex tasks with minimal diverse human annotations.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish