2024-12-20 · HuggingFace

Evaluating Audio Reasoning with Big Bench Audio

modelsresearch

Evaluating Audio Reasoning with Big Bench Audio

Source: HuggingFace Date: 2024-12-20 URL: https://huggingface.co/blog/big-bench-audio-release

Summary

Research benchmark release: Big Bench Audio adapts 1,000 Big Bench Hard questions (Formal Fallacies, Navigate, Object Counting, Web of Lies) into audio format for evaluating audio reasoning in speech-to-speech models. Key finding: GPT-4o achieves 92% on text-to-text; GPT-4o Realtime drops to 66% speech-to-speech and 74% text-to-speech — a 26-point “audio reasoning gap.” Traditional pipeline (Whisper → GPT-4o → TTS-1) nearly matches text performance. Native speech-to-speech models underperform transcription-then-LLM pipelines on reasoning tasks. Dataset published on HF.

Implications

Open-weights ecosystem health. The 26-point reasoning gap between text and native speech modalities is a concrete data point for anyone evaluating audio-native vs. pipeline architectures for voice AI. The finding that traditional cascaded pipelines outperform native audio models on reasoning tasks has direct implications for product decisions: building voice agents with Whisper + LLM + TTS remains the quality-maximizing choice until native audio architectures close this gap.

Model release cadence. Big Bench Audio provides a concrete benchmark for comparing future native audio model releases against the Whisper-pipeline baseline. As Gemini, GPT-4o Realtime, and open-weights audio models compete, this benchmark should appear in future release evaluations — giving the community a way to measure whether native audio reasoning is actually improving.

← all signals