BlockBeats News, April 29th. NVIDIA officially launched Nemotron 3 Nano Omni, the newest member of the Nemotron 3 series, which integrates unified multimodal reasoning into a single efficient open-source model. NVIDIA stated that agentic systems typically require a single perception-to-action loop across screens, documents, audio, video, and text but still rely on a fragmented model chain—separate visual, audio, and text technology stacks. This increases the number of reasoning hops and the complexity of orchestration, raising the cost of reasoning while weakening cross-modal contextual consistency. Nemotron 3 Nano Omni aims to replace this fragmented visual-language-audio technology stack, acting as a multimodal perception and context sub-agent within agentic systems.
In terms of accuracy, Nemotron 3 Nano Omni achieved top scores in the document intelligence leaderboard and also led in the video and audio understanding leaderboards. When evaluating the video understanding model on the open industry benchmark MediaPerf, Nemotron 3 Nano Omni achieved the highest throughput in each task and the lowest inference cost in video-level annotation tasks.
On the performance front, at a fixed per-user interaction threshold, for video inference, Nemotron 3 Nano Omni maintained a higher total system throughput, achieving up to approximately 9.2 times higher effective system capacity compared to other open-source omni models; for multi-document inference, it could achieve up to approximately 7.4 times higher effective system capacity. NVIDIA stated that this model aims to replace traditional multi-model concatenation architectures, reduce inference complexity and costs, and drive the application of multimodal AI in finance, healthcare, research, media, and other scenarios.
