According to Dynamic Beating monitoring, video generation powerhouse Sand.ai (founded in January 2024) has announced the completion of two rounds of financing totaling over a billion dollars. Investors include Look Capital, Lollapalooza Capital (Family Office of Wang Huiwen), Nine Heaven Ventures, Wei Capital (MSA Capital), Innovation Works, Source Code Capital, IDG, Baidu Ventures, and several other top-tier institutions. StarrySky Capital served as the financial advisor for this round of financing.
Sand.ai's founder, Cao Yue, stated in an interview that the team has been committed to the non-consensus Autoregressive video generation path, rather than the mainstream Diffusion path. Their previously released Magi-1 model has consistently ranked first on Google DeepMind's Physics-IQ physical realism test leaderboard.
To break through the "cost, speed, quality" impossible triangle of video generation, Sand.ai shifted its focus last year to exploring the MoE (Mixture of Experts) architecture. They plan to release a new generation of video generation models using the MoE architecture in July 2026 (Q3), balancing efficient inference with the largest parameter scale in the current open-source field, and will open-source this model.
In terms of commercialization, Sand.ai follows a dual-cycle strategy of model and product driving. Their music Agent product VidMuse, launched in January of this year, has achieved $10 million in ARR in just 2 months. In addition, their open-source MagiAttention operator library has been used by almost all domestic multimodal model teams and is officially recommended by NVIDIA.
Regarding the industry's hotly debated concept of the "world model," Cao Yue believes that it is still in the pre-GPT era (before GPT-1 appeared), with both data and paths yet to converge. He points out that video is the most crucial data modality for advancing towards a world model, and models should autonomously learn physical laws by predicting the raw observation data of videos (Pixels/Frames), rather than introducing human priors to explicitly model state variables.
