NVIDIA Releases Nemotron 3 Nano Omni Multimodal AI Model
SANTA CLARA, Calif. — NVIDIA on Monday released Nemotron 3 Nano Omni, a 30-billion-parameter open-weights multimodal AI model for processing text, images, video and audio in enterprise agentic AI workflows.
The model, made available as open-weights checkpoints on Hugging Face, can handle documents exceeding 100 pages and more than five hours of audio, according to a technical blog post published by the company (https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence). It represents NVIDIA’s latest push into the competitive open-weights AI model market, where it faces rivals including Alibaba’s Qwen and Meta’s Llama families.
Architecture and Capabilities
Nemotron 3 Nano Omni uses a hybrid Mamba-Transformer architecture with a mixture-of-experts design — 23 Mamba selective state-space layers, 23 mixture-of-experts layers with 128 experts, and six grouped-query attention layers, according to the technical disclosure. The “A3B” designation indicates an adaptive three-stage design with 30 billion total parameters.
The model integrates a C-RADIOv4-H vision encoder and a Parakeet-TDT-0.6B-v2 audio encoder, allowing it to natively process audio at 16 kHz rather than relying on text transcriptions — a distinction NVIDIA highlighted as a key differentiator from competitors that use speech-to-text pipelines.
For document understanding, the model supports dynamic resolution vision processing from 512×512 to 1,840×1,840 pixels per image, handling complex PDFs with tables, figures and formulas. NVIDIA reported the model scored 57.5 on the MMLongBench-Doc benchmark for long document comprehension, compared to 49.5 for Alibaba’s similarly sized Qwen3-Omni 30B-A3B model, according to the blog post.
Agentic AI Focus
NVIDIA highlighted agentic computer use as a key application — the ability for AI systems to interpret graphical user interfaces, reason about screenshots and automate workflows. On the OSWorld benchmark for agentic computing tasks, Nemotron 3 Nano Omni scored 47.4, compared to 29.0 for Qwen3-Omni, according to NVIDIA’s reported benchmarks.
The model also showed results on ScreenSpot-Pro, a benchmark for GUI element identification, scoring 57.8, up from 5.5 for NVIDIA’s previous Nemotron Nano V2 VL model on the same test, the company said.
Efficiency Claims
NVIDIA emphasized throughput and efficiency gains alongside raw benchmark scores. The company claimed 7.4 times higher system efficiency for multi-document use cases, 9.2 times higher efficiency for video tasks and 2.9 times faster single-stream reasoning speed compared to competing models, according to the blog post.
Training and Data
The model was trained on NVIDIA H100 GPU clusters spanning 32 to 128 nodes, using the company’s Megatron-LM framework. NVIDIA disclosed that approximately 11.4 million synthetic question-answer pairs — roughly 45 billion tokens — were generated from PDF documents to improve document reasoning capabilities, yielding a 2.19 times improvement on the MMLongBench-Doc benchmark from synthetic data alone.
The training process included what NVIDIA described as “omni reinforcement learning,” which trains the model to reason across images, video, audio and text simultaneously. The approach includes intentionally unanswerable cases “to teach the model to abstain when evidence is insufficient rather than hallucinate,” according to the blog post.
Availability
NVIDIA released model checkpoints in three precision formats — BF16, FP8 and NVFP4 — on Hugging Face, along with a full technical report and training dataset. The company also published reinforcement learning guides and data synthesis recipes through its open-source NeMo framework on GitHub.