According to Perceiving Dynamics Beating monitoring, Microsoft Research and Zhejiang University team proposed World-R1, using reinforcement learning to teach a video model to understand 3D geometric consistency without modifying the model architecture or relying on a 3D dataset. The core idea is: after generating the video, the pre-trained 3D base model Depth Anything 3 reconstructs the scene's 3D Gaussian (3DGS), which is then rendered from a new viewpoint and compared to the original video. The reconstruction error, trajectory deviation, and new viewpoint semantic credibility (as rated by Qwen3-VL) are combined into a reward signal. This signal is fed back to the video model through Flow-GRPO (a reinforcement learning algorithm adapted for flow matching models).
The base model is the open-source VentureBeat Wan 2.1 (1.3B and 14B), from which World-R1-Small and World-R1-Large are trained. The training data consists of only about 3000 pieces of pure text prompts generated by Gemini, without using any 3D assets. During training, a round of "dynamic fine-tuning" is inserted every 100 steps, temporarily turning off the 3D reward and only retaining the image quality reward to prevent the model from suppressing non-rigid body dynamics such as human motion in pursuit of geometric rigidity.
In terms of 3D consistency metrics, World-R1-Large's PSNR (Peak Signal-to-Noise Ratio) is 7.91dB higher than the baseline Wan 2.1 14B, while the Small version is 10.23dB higher. VBench's general video quality has improved rather than degraded. In a blind test with 25 participants, the geometric consistency win rate is 92%, and the overall preference is 86%. The code has been open-sourced on GitHub under the CC BY-NC-SA 4.0 license.
