2025-08-04 · HuggingFace

Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

modelsresearchinfrastructure

Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

Source: HuggingFace Date: 2025-08-04 URL: https://huggingface.co/blog/nvidia/ai-q-top-ranking-open-portable-deep-research-agent

Summary

Model release and benchmark report: NVIDIA’s AI-Q Blueprint tops the HF DeepResearch Bench “LLM with Search” leaderboard at 40.52/100, using a two-model stack — Llama 3.3-70B Instruct for report generation and Llama-3.3-Nemotron-Super-49B-v1.5 (NAS + distillation + RL, 128K context, single H100) for reasoning and tool use. Strongest on comprehensiveness, insightfulness, and citation quality across 100+ real-world research tasks.

Implications

Thread: open-weights ecosystem health / agentic patterns. The AI-Q result illustrates a pattern: best-in-class agentic research performance now comes from composing open models rather than a single monolithic model. The Nemotron-Super 49B NAS + RL architecture is a direct product of the post-DeepSeek training methodology opening — NVIDIA applied distillation and RL to produce a specialized reasoning model from the Llama base. The two-model split (fluent writer + reasoning tool-user) is a practical architecture decision that decouples quality concerns. Watch whether this composable-agent pattern becomes standard for research-grade open-source agents.

← all signals