NewsFlash Articles Data Fundraising Skill&API

Introducing AlphaGo Search, a new MCTS-based video generation framework that can produce longer and more coherent videos

According to Watchful AI monitoring, researchers from the University of Waterloo, Brown University, and other institutions proposed a novel inference-time scaling framework called Planning at Inference in a paper submitted at ICLR 2026. For the first time, they applied AlphaGo's Monte Carlo Tree Search (MCTS) algorithm to long video generation, modeling the task as a sequential decision problem. The system introduces MCTS during the inference phase, utilizing look-ahead rollouts and reward backpropagation to evaluate multiple video continuation segments, fundamentally addressing the common challenges of semantic drift and error accumulation faced in traditional chunking or one-shot generation.

To achieve efficient exploration in the continuous video generation space, the research team specially designed a variant of Multi-Tree MCTS. Compared to the traditional approach of using a single search tree under a fixed computational budget, the multi-tree architecture can expansively search the continuous state space with a more reasonable pruning and branching factor, significantly improving exploration efficiency. Importantly, Planning at Inference exhibits highly modular characteristics and serves as a fully plug-and-play inference-time optimization solution. Developers can directly deploy this solution on existing video generation frameworks without any need to retrain or fine-tune the underlying large models.

In experiments based on the NVIDIA open-source video prediction model Cosmos-Predict2, Planning at Inference demonstrated strong generative performance. In long video generation evaluations, this solution successfully generated high-quality coherent videos exceeding 20 seconds. Test data indicate that in core metrics such as object persistence, temporal coherence, and text-video alignment, the quality of MCTS search generation has seen a significant improvement compared to traditional baseline methods such as Greedy Search, Beam Search, and Best-of-N. Compared to leading closed-source large models in the industry, videos generated by this approach are respectively 18% longer than Sora and 47% longer than Kling, while maintaining comparable image sharpness and visual fidelity to both.

Despite the exceptional visual coherence brought by the search mechanism, the introduction of multi-tree search at the inference stage has also incurred high computational costs. Researchers admit that the current Planning at Inference framework is noticeably slower in generation speed compared to traditional autoregressive direct generation, which to some extent limits real-time deployment possibilities. However, as the efficiency of underlying video generation frameworks evolves and computational hardware capabilities continue to grow, the inference-time scaling approach, trading computational cost for visual quality, is expected to become a key technological pathway towards engineering practicality in long video generation once the underlying model capabilities surpass a certain threshold.

Source

Correction/Report

On-Chain Activity