According to DolphinBeat monitoring, ByteDance (ByteDance Research) has officially open-sourced the native unified multimodal large model, Lance. This is a lightweight model with only 3B activation parameters, capable of simultaneous image and video understanding, generation, and editing within a single framework.
Currently, mainstream unified models heavily rely on scaling up parameters or adopting the ViT architecture. In contrast, Lance has explored a low-compute power collaborative path. The research team trained the model entirely from scratch and managed to keep the total compute budget for the entire training period to 128 A100 GPUs.
To address internal conflicts between different modalities and tasks, Lance has implemented two rigid isolations in its architecture:
- It employs a dual-stream Mixtures of Experts (MoE) architecture to handle interleaved multimodal sequences, sharing the underlying context while decoupling the computation paths for understanding and generation.
- It introduces modality-aware rotational position encoding, directly mitigating signal interference between visual tokens of heterogeneous image and video modalities.
The extreme compute compression has not compromised the performance ceiling. With only 3B activation parameters, Lance's image and video generation and editing performance lead in the majority of benchmark tests among existing open-source unified models. Through multi-task collaboration, it has successfully demonstrated a low-cost route that balances generation and semantic understanding with small parameters.
