2025-08-20 · HuggingFace

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

pricingmodelsresearchinfrastructure

read at source ↗ huggingface.co

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

Source: HuggingFace Date: 2025-08-20 URL: https://huggingface.co/blog/nvidia/multilingual-reasoning-v1

Summary

Dataset release and model release: NVIDIA ships a 6M-example multilingual reasoning dataset (English reasoning translated into French, Spanish, German, Italian, Japanese) alongside Nemotron Nano 2 9B — a hybrid Transformer-Mamba edge model with up to 6x higher throughput and configurable reasoning token budgets (up to 60% reasoning cost savings). Translation quality enforced via line-by-line method + fastText filtering; 1.1% of examples discarded.

Implications

Thread: open-weights ecosystem health / model release cadence. The dataset methodology is the interesting signal: translating reasoning chains rather than answers preserves the logical structure across languages — a more defensible approach than translating only final responses. The 6x throughput claim for Nemotron Nano 2 on hybrid Transformer-Mamba architecture continues the pattern of SSM-hybrid models closing the inference efficiency gap with pure transformers. Configurable thinking budget is a practical deployment feature: it lets latency-sensitive applications trade accuracy for speed on a per-request basis.

← all signals