header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Online Policy Distillation with Dreaming Simulation for Scalable End-to-End Learning of Novel Solutions

According to Dynam.AI Beating monitoring, large language models commonly face the challenge of being unable to sustainably absorb new knowledge after deployment. Current optimization techniques mainly focus on expanding the context window and improving search speed, which only allows the model to temporarily look up information within a single conversation. Once the dialogue ends, the knowledge is entirely forgotten. The real bottleneck for continuous learning of large models lies not in these search speed optimizations, but in how to physically rewrite the experiences learned in dialogues into the underlying weight parameters of the large model.

Online Policy Self-Distillation (OPSD) provides a new weight updating path. When a large model faces a task, its "teacher state" with a complete long-context generates high-quality answers. Subsequently, the system calculates dense supervision signals in the cloud through backpropagation, by computing the probability difference at the token level between the base state (student) and the teacher state, allowing the base model to approximate that smart state that scored high.

Compared to the supervised fine-tuning (SFT) that forcefully makes the model memorize all dialogue texts, self-distillation only extracts decision-making experiences necessary to maintain performance. This extremely sparse parameter update can prevent Catastrophic Forgetting, preserving the large model's original common sense from being overwritten.

Another more forward-looking learning path is Dreaming Simulation. When facing complex tasks, the large model consumes significant inference period computational power to self-play scenarios in its mind. Based on observed daily patterns, the model automatically constructs a virtual simulator environment and conducts tens of thousands of task rehearsals within the simulator environment. If the rehearsals are successful, the system records the successful trajectories as teaching materials and updates the base model's underlying weights. Compared to lightweight compression that only generates short summaries, Dreaming Simulation consumes massive cloud-based computation to repeatedly pre-enact, representing the fourth dimension of expansion for large models.

It is projected that from 2027 to 2028, AI agents will undergo work evaluations after collaborating with humans for one week. Once accredited, the system can distill the accumulated practical experience of the week through Online Policy Self-Distillation (OPSD) or Dreaming Simulation in the cloud into the model's underlying weights, achieving online expansion of capabilities post-deployment, enabling the large model to get smarter with increased usage.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish