According to Perceiving Beat monitoring, Xiaomi Auto has officially unveiled the Xiaomi EV World Model, a new framework for assisted driving world modeling. For the first time internally, it has achieved deep coupling of 3D reconstruction and video generation modules. In traditional autonomous driving simulation, the reconstruction and generation are often separated. The reconstruction module can restore the scene but cannot predict changes, while the generation module can predict the future but is prone to distortion and drift over long time sequences. The team has proposed the JointWM architecture, using a 3D geometric structure as a physical skeleton to anchor the scene. It then completes visual details through the generation module and predicts unobserved areas. This architecture has refreshed multiple performance records in mainstream benchmarks such as Waymo and nuScenes.
Specifically, in terms of mechanism, the reconstruction module WorldRec abandons the traditional per-pixel paradigm and adopts sparse 3D query points for scene representation. It incrementally fuses into a cross-view 4D Gaussian spatial skeleton, achieving 10-second rapid reconstruction of a 10-second video. Based on the geometric priors provided by the reconstruction module, the generation module WorldGen, constrained by the skeleton's physical boundaries, is only responsible for generating reasonable lighting and textures. For content beyond future frames and field-of-view blind spots, the generation module performs physical prediction through two-stage temporal training and distribution-matching distillation mechanisms. The entire architecture achieves a single-view generation speed of 0.19 seconds and a three-view speed of 0.46 seconds on an H20 GPU, supporting video generation of up to 1 minute.
This solution achieved a 28.48 PSNR score in Waymo's reconstruction accuracy test and maintains a leading position in nuScenes zero-shot generalization. In terms of generation efficiency, the solution is 5.6 times faster than the autoregressive baseline Epona and ranks among the top in spatiotemporal consistency among similar algorithms. Currently, the research results have been implemented in Xiaomi Auto's three major scenarios, including delivering over 100,000 segments of high-quality synthetic data for perception model training, constructing a highly realistic closed-loop simulation environment to reproduce long-tail road conditions, and launching an assisted driving academy to provide generative video guidance for user operations.
