Contextual Retrieval in AI Systems
read at source ↗ www.anthropic.com
Contextual Retrieval in AI Systems
Source: Anthropic Engineering Date: 2024-09-19 URL: https://www.anthropic.com/engineering/contextual-retrieval
Summary
Anthropic introduces Contextual Retrieval: prepending chunk-specific explanatory context (generated by Claude) to each text chunk before embedding and BM25 indexing, reducing retrieval failures by 35% (embeddings alone), 49% (with Contextual BM25), and 67% (with reranking added). For knowledge bases under 200,000 tokens, Anthropic recommends full-corpus inclusion with prompt caching instead, which reduces costs by up to 90%.
Implications
The context engineering thread. Contextual Retrieval is the RAG-layer implementation of just-in-time retrieval — rather than dumping a full knowledge base into context, it improves the precision of what gets retrieved. The 67% failure reduction from the full stack (contextual embeddings + BM25 + reranking) is a strong result for a preprocessing-only change.
Prompt caching as the simpler alternative. The explicit recommendation to skip RAG entirely for knowledge bases under 200k tokens, using prompt caching for up to 90% cost reduction, is significant product guidance. For many internal knowledge bases this is the right answer — RAG infrastructure has a complexity cost that prompt caching avoids.
Claude as its own preprocessing tool. Using Claude to auto-generate chunk context is an early example of the AI-in-the-loop tool development pattern that the writing-effective-tools post later formalizes. The preprocessing quality depends on Claude’s ability to understand the document structure — which makes this approach model-quality-dependent in ways that traditional BM25 isn’t.