According to Insightful Beating monitoring, there has been a significant change in the training methodology of DeepSeek V4: the mixed RL phase of V3.2 has been completely replaced by On-Policy Distillation (OPD).
The new process consists of two steps. In the first step, focusing on domains such as mathematics, code, agent, and instruction following, domain expert models are separately trained on the V3.2 pipeline, with each expert undergoing fine-tuning followed by reinforcement learning using GRPO. In the second step, a multi-teacher OPD distills the abilities of over a dozen experts into a unified model: the student, on its self-generated trajectories, performs reverse KL divergence full vocabulary logit distillation for each teacher, aligning at the logits level to consolidate multiple expert weights into a unified parameter space, thereby avoiding traditional weight merging and common capability conflicts seen in mixed RL.
The report also introduces the Generative Reward Model (GRM): for tasks that are difficult to validate with rules, instead of training a traditional scalar reward model, a GRM trained on RL data guided by rubric is used, enabling the actor network to simultaneously generate and assess, generalizing to complex tasks with minimal diverse human annotations.
