NewsFlash Articles Data Fundraising Skill&API

DeepSeek V4 Post-Training Overhaul: OPD Replacing Hybrid RL, Over a Dozen Expert Models Distilled into One

According to Insightful Beating monitoring, there has been a significant change in the training methodology of DeepSeek V4: the mixed RL phase of V3.2 has been completely replaced by On-Policy Distillation (OPD).

The new process consists of two steps. In the first step, focusing on domains such as mathematics, code, agent, and instruction following, domain expert models are separately trained on the V3.2 pipeline, with each expert undergoing fine-tuning followed by reinforcement learning using GRPO. In the second step, a multi-teacher OPD distills the abilities of over a dozen experts into a unified model: the student, on its self-generated trajectories, performs reverse KL divergence full vocabulary logit distillation for each teacher, aligning at the logits level to consolidate multiple expert weights into a unified parameter space, thereby avoiding traditional weight merging and common capability conflicts seen in mixed RL.

The report also introduces the Generative Reward Model (GRM): for tasks that are difficult to validate with rules, instead of training a traditional scalar reward model, a GRM trained on RL data guided by rubric is used, enabling the actor network to simultaneously generate and assess, generalizing to complex tasks with minimal diverse human annotations.

Source

Correction/Report

On-Chain Activity

4h ago

Trump's 'Magical Day': Daytime Rally for TRUMP Holders, Evening White House Press Dinner Shooting Incident

If Bitcoin breaks $80,000, the mainstream CEX cumulative short liquidation pressure will reach $619 million.

Trump: Being a Target for Assassination is an "Honor"

Ethereum Foundation Stakes $48.9 Million Worth of ETH

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

DeepSeek V4 Post-Training Overhaul: OPD Replacing Hybrid RL, Over a Dozen Expert Models Distilled into One

The Bless project team sold 500 million tokens after the recent BLESS price surge.

The Balancer attacker has converted 21,000 ETH into 617.43 BTC over the past three days

The Bless team is suspected of once again selling nearly 100 million tokens, causing a brief price drop of over 10%.

Two Whale Addresses Increased Their LINK Holdings Today, With a Cumulative Withdrawal of Tokens Worth $4.67 Million