header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Sand.ai Secures Over $100 Million in Funding: Committed to Self-Regressing Video Route, Plans to Release Open-Source MoE Mega Model in July

According to Dynamic Beating monitoring, video generation powerhouse Sand.ai (founded in January 2024) has announced the completion of two rounds of financing totaling over a billion dollars. Investors include Look Capital, Lollapalooza Capital (Family Office of Wang Huiwen), Nine Heaven Ventures, Wei Capital (MSA Capital), Innovation Works, Source Code Capital, IDG, Baidu Ventures, and several other top-tier institutions. StarrySky Capital served as the financial advisor for this round of financing.

Sand.ai's founder, Cao Yue, stated in an interview that the team has been committed to the non-consensus Autoregressive video generation path, rather than the mainstream Diffusion path. Their previously released Magi-1 model has consistently ranked first on Google DeepMind's Physics-IQ physical realism test leaderboard.

To break through the "cost, speed, quality" impossible triangle of video generation, Sand.ai shifted its focus last year to exploring the MoE (Mixture of Experts) architecture. They plan to release a new generation of video generation models using the MoE architecture in July 2026 (Q3), balancing efficient inference with the largest parameter scale in the current open-source field, and will open-source this model.

In terms of commercialization, Sand.ai follows a dual-cycle strategy of model and product driving. Their music Agent product VidMuse, launched in January of this year, has achieved $10 million in ARR in just 2 months. In addition, their open-source MagiAttention operator library has been used by almost all domestic multimodal model teams and is officially recommended by NVIDIA.

Regarding the industry's hotly debated concept of the "world model," Cao Yue believes that it is still in the pre-GPT era (before GPT-1 appeared), with both data and paths yet to converge. He points out that video is the most crucial data modality for advancing towards a world model, and models should autonomously learn physical laws by predicting the raw observation data of videos (Pixels/Frames), rather than introducing human priors to explicitly model state variables.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish