2024-12-05 · HuggingFace

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

modelsresearchinfrastructure

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

Source: HuggingFace Date: 2024-12-05 URL: https://huggingface.co/blog/keras-chatbot-arena

Summary

Research experiment: informal study testing 8 open-weights models (1B–9B) on a simplified calendar API chatbot task — both initial correctness and mistake-fixing from plain-English feedback. Results: Gemma 2 9B and Llama 3.1 8B are the only models achieving near-perfect performance across both dimensions. Models under 3B struggle significantly on initial correctness. Implemented on TPU v5e 2x4 via Keras + JAX. Note: Google Gemini (much larger, closed) performed worse than Gemma 2 9B — size alone isn’t predictive of instruction-following on this task.

Implications

Open-weights ecosystem health. Gemma 2 9B and Llama 3.1 8B both achieving near-perfect tool-call accuracy in late 2024 is a baseline marker — these were the practical floor for reliable function calling in the open-weights tier at that time. The 3B cutoff for reliable behavior is a useful rule of thumb that has since been pushed down by subsequent model generations.

Transformers library trajectory. The Keras + JAX + TPU implementation path being used for this experiment (rather than PyTorch/transformers) shows that the Keras multi-backend story was gaining real traction as an alternative for research experiments, particularly for researchers with TPU access.

← all signals