According to DynaAware Beating monitoring, "on-policy sampling" during large model fine-tuning (i.e., training the model based on data it generates in real-time) is a key strategy to prevent model degradation and enhance problem-solving ability. The superiority of Online Policy Distillation (OPD) and Reinforcement Learning (RL) over traditional Supervised Fine-Tuning (SFT) lies in the fact that they optimize the model based on the steps it generates rather than rote memorization of external standard answers.
SFT forcibly imposes standard answers, evenly distributing the modification force on each word, which easily disrupts the model's original knowledge structure and leads to forgetting. In contrast, RL and OPD enable the model to search for and reinforce the best steps within its self-generated draft. This not only avoids the accumulation of errors from "starting with one wrong word and deviating all the way," but also ensures that updates occur only within the model's known knowledge area, thus preserving its innate capabilities to the maximum.
In the "Minimum Code Edit" experiment, whether using an SFT or RL tutor for on-policy distillation, the student model's one-shot success rate in writing correct code (Pass@1) reached 80.0% and 78.7%, respectively, surpassing the tutor models. Even when an SFT tutor "dumbed down" significantly due to excessive fine-tuning (dropping from 0.320 to 0.286 in the LiveCodeBench code proficiency test), the student model it produced still achieved a high score of 0.297, with little impact from the tutor's defects, proving that on-policy training can effectively filter out bad tutor habits.
Currently, DeepSeek-V4 and GLM-5 have introduced on-policy distillation to incorporate expert model capabilities. In expert training, domains with clear right and wrong answers like code and mathematics are more suitable for RL, while creative and knowledge-based subjective tasks are better suited for on-policy distillation. The future ultimate fine-tuning algorithm will undoubtedly need to operate within an on-policy training framework to find a new mechanism that combines distillation efficiency (high information density) with RL objectivity (unbiased updates).
