2024-10-28 · HuggingFace

Expert Support case study: Bolstering a RAG app with LLM-as-a-Judge

modelsresearch

Expert Support case study: Bolstering a RAG app with LLM-as-a-Judge

Source: HuggingFace Date: 2024-10-28 URL: https://huggingface.co/blog/digital-green-llm-judge

Summary

Case study: Digital Green’s Farmer.chat agricultural RAG app (340K+ queries, 20K+ farmers, 6+ languages, 46K research papers) evaluated four LLMs as answer generators using LLM-as-a-Judge scoring for faithfulness and relevance. Gemini-1.5-Flash selected as best tradeoff: 89% faithfulness, 4.5% unanswered rate vs GPT-4-turbo’s 21.9% refusal rate and Llama-3-70B’s 0.3% refusal but 78% faithfulness.

Implications

Thread: open-weights ecosystem health / agentic patterns. The high GPT-4-turbo refusal rate (21.9% “I don’t know” on agricultural queries) is practically important: frontier safety tuning can over-refuse on domain-specific content. Gemini-1.5-Flash outperforming on the full faithfulness × answered product is a real deployment decision, not just a benchmark. The LLM-as-judge correlation with human evaluators on 360 test questions validates the methodology for production RAG evaluation. The RAGAS-inspired binary faithfulness scoring is a reusable pattern for anyone building RAG evaluation pipelines at scale.

← all signals