2025-12-17 · HuggingFace

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

protocolsmodelsinfrastructure

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

Source: HuggingFace Date: 2025-12-17 URL: https://huggingface.co/blog/nvidia/nemotron-3-nano-evaluation-recipe

Summary

Model release + evaluation tutorial: NVIDIA Nemotron 3 Nano 30B A3B (30B total / 3B active MoE) benchmarked with full methodology transparency via NeMo Evaluator. Published scores: AIME 2025 89.1%, GPQA 73.0%, MMLU-Pro 78.3%, LiveCodeBench 68.3%, BFCL v4 53.8%. NeMo Evaluator is an open-source evaluation orchestrator integrating NeMo Skills + LM Evaluation Harness, infrastructure-agnostic, with auditable logs and per-task results.json.

Implications

Thread: open-weights ecosystem health / model release cadence. NVIDIA publishing full evaluation recipes (configs, prompts, settings) alongside scores is a methodological accountability move in a landscape where benchmark comparability is widely doubted. The NeMo Evaluator tool itself is the more lasting contribution — a unified evaluation harness that works across hosting environments could become the de facto standard for reproducing published benchmarks. The 89.1% AIME 2025 at 3B active parameters is a strong math reasoning result; watch whether this gets independently reproduced. The HLE score (10.6%) is a useful calibration point for the hardest frontier benchmark.

← all signals