NewsFlash Articles Data Fundraising Skill&API

Large-scale Model Post-Training Discovery: Conducting "Same-Track Training" with self-generated data is the key to student surpassing the teacher without degradation

According to DynaAware Beating monitoring, "on-policy sampling" during large model fine-tuning (i.e., training the model based on data it generates in real-time) is a key strategy to prevent model degradation and enhance problem-solving ability. The superiority of Online Policy Distillation (OPD) and Reinforcement Learning (RL) over traditional Supervised Fine-Tuning (SFT) lies in the fact that they optimize the model based on the steps it generates rather than rote memorization of external standard answers.

SFT forcibly imposes standard answers, evenly distributing the modification force on each word, which easily disrupts the model's original knowledge structure and leads to forgetting. In contrast, RL and OPD enable the model to search for and reinforce the best steps within its self-generated draft. This not only avoids the accumulation of errors from "starting with one wrong word and deviating all the way," but also ensures that updates occur only within the model's known knowledge area, thus preserving its innate capabilities to the maximum.

In the "Minimum Code Edit" experiment, whether using an SFT or RL tutor for on-policy distillation, the student model's one-shot success rate in writing correct code (Pass@1) reached 80.0% and 78.7%, respectively, surpassing the tutor models. Even when an SFT tutor "dumbed down" significantly due to excessive fine-tuning (dropping from 0.320 to 0.286 in the LiveCodeBench code proficiency test), the student model it produced still achieved a high score of 0.297, with little impact from the tutor's defects, proving that on-policy training can effectively filter out bad tutor habits.

Currently, DeepSeek-V4 and GLM-5 have introduced on-policy distillation to incorporate expert model capabilities. In expert training, domains with clear right and wrong answers like code and mathematics are more suitable for RL, while creative and knowledge-based subjective tasks are better suited for on-policy distillation. The future ultimate fine-tuning algorithm will undoubtedly need to operate within an on-policy training framework to find a new mechanism that combines distillation efficiency (high information density) with RL objectivity (unbiased updates).

Source

Correction/Report

On-Chain Activity

3h ago

The largest ETH short seller, known as “pension-usdt.eth,” has initiated a liquidation event, giving back $5.9 million in unrealized gains due to a price rebound.

3h ago

Limitless Labs has completed a $20 million Series A funding round to expand its "Physical AI" core model and precision manufacturing platform.

SpaceX's total market capitalization has surpassed Amazon's, making it the world's fifth-largest company.

SpaceX IPO Surges, Extending Gain to 9%

Source: Insider <p>Insiders: Binance's Greece license application rejected, facing risk of losing EU service eligibility</p>

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

Large-scale Model Post-Training Discovery: Conducting "Same-Track Training" with self-generated data is the key to student surpassing the teacher without degradation

The largest ETH short seller, known as “pension-usdt.eth,” has initiated a liquidation event, giving back $5.9 million in unrealized gains due to a price rebound.

HIP-3 US Stock Gainers: SPCX Leads Gains, Storage Semiconductor Sector Shows Strength

From Retail Investor to $7.7 Million Stock Market Whale, 'Stock Trading King' Turns $30,000 into Hundredfold Profit

Today's BTC largest long position reached $12.06 million, with the whale liquidation price at $61,900.