Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
modelsresearchinfrastructure
read at source ↗ huggingface.co
Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
Source: HuggingFace Date: 2026-04-28 URL: https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence
Summary
NVIDIA Nemotron 3 Nano Omni: 30B total / 3B active (128 experts, top-6 MoE). Hybrid Mamba-Transformer-MoE. Open weight. Natively multimodal: text + vision + audio + video fused in backbone. OSWorld 47.4 (GUI reasoning for computer use). 5+ hours audio context. 100+ page documents. NVFP4 quantization at 18GB.
Implications
- First open-weight multimodal agent model at 3B active params — inference should be fast
- OSWorld 47.4 means GUI reasoning for agentic computer use
- NVFP4 at 18GB is marginal for Apple Silicon; GGUF community quants could bring to ~10-12GB fitting consumer hardware (M-series, 3060 class)
- Mamba-Transformer-MoE hybrid is architecturally novel — efficient long context via SSM layers + sparse routing via MoE
- If community quants succeed, this enables local multimodal agent (screen reading, document analysis, speech) on consumer hardware