2026-04-28 · HuggingFace

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

modelsresearchinfrastructure

read at source ↗ huggingface.co

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

Source: HuggingFace Date: 2026-04-28 URL: https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence

Summary

NVIDIA Nemotron 3 Nano Omni: 30B total / 3B active (128 experts, top-6 MoE). Hybrid Mamba-Transformer-MoE. Open weight. Natively multimodal: text + vision + audio + video fused in backbone. OSWorld 47.4 (GUI reasoning for computer use). 5+ hours audio context. 100+ page documents. NVFP4 quantization at 18GB.

Implications

  • First open-weight multimodal agent model at 3B active params — inference should be fast
  • OSWorld 47.4 means GUI reasoning for agentic computer use
  • NVFP4 at 18GB is marginal for Apple Silicon; GGUF community quants could bring to ~10-12GB fitting consumer hardware (M-series, 3060 class)
  • Mamba-Transformer-MoE hybrid is architecturally novel — efficient long context via SSM layers + sparse routing via MoE
  • If community quants succeed, this enables local multimodal agent (screen reading, document analysis, speech) on consumer hardware

← all signals