According to 1M AI News monitoring, an anonymous model named HappyHorse-1.0 topped the AI Video Evaluation platform Artificial Analysis's Video Arena leaderboard last week, securing the first spot in both the text-based video and image-based video tracks (no audio category), pushing ByteDance's Seedance 2.0 to second place. In the audio category, Seedance 2.0 still maintains a slight lead. There has been no press release, technical blog post, company attribution, and to this day, no one has publicly claimed ownership.
The Video Arena ranking is based on the Elo blind test system, where users vote for the preferred video between two generated videos without knowing the model's identity. HappyHorse had a shorter time on the leaderboard, with a comparison sample size of approximately 3500 times, less than half of Seedance 2.0's, with a broad confidence interval (±12-13 points). However, its lead in the audio-free track (around 76 points for text-based video and 48 points for image-based video) still significantly exceeds the margin of error.
From the official website's language order (Chinese and Cantonese ahead of English) and the reference to the "HappyHorse" meme of the 2026 Year of the Horse, the industry speculates that the model originates from a Chinese team. There are two main theories:
1. Several industry self-media outlets claim that the model is from Alibaba's Taotian Group Future Life Lab, led by Director Zhang Di. Zhang Di previously served as the Vice President of Technology at Kuaishou, leading the R&D of Kelin AI starting in 2024 and releasing Kelin 2.0 Master Edition in April 2025 before returning to Alibaba in November of the same year.
2. User X Vigo Zhao conducted a detailed comparison and found that HappyHorse aligns perfectly with multiple benchmark metrics with AI video startup Sand.ai's daVinci-MagiHuman, which was open-sourced in March of this year. The official website structures are also highly similar. Sand.ai was founded by Cao Yue, the first author of the Swin Transformer, and is known in the industry as the "DeepSeek of the AI video world."
The HappyHorse official website indicates that the model consists of 15 billion parameters, a 40-layer self-attention Transformer, utilizing the Transfusion architecture (unifying text autoregressive prediction and video-audio diffused generation within the same model), 8-step inference, outputting 1080p videos with synchronized audio, supporting lip-sync in seven languages (Chinese, English, Japanese, Korean, German, French, and Cantonese), fully open-source, and available for commercial use.
