RAG (Retrieval Augmented Generation)
RAG (Retrieval Augmented Generation)
Grounding LLM responses in external documents by retrieving relevant content at inference time. The model doesn't need to memorize everything — it looks things up when needed. Introduced by Lewis et al. in "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020, arXiv:2005.11401).
The Basic Pattern
- Index documents into a vector store (embed text chunks, store embeddings)
- At query time, embed the user's question
- Retrieve the top-K chunks most similar to the question
- Inject retrieved chunks into the model's context
- Model answers based on the retrieved content + its training knowledge
This lets you use a model with a fixed training cutoff to answer questions about recent documents, proprietary knowledge, or data the model never saw.
When RAG Makes Sense
- Private knowledge: Company docs, codebases, internal wikis — anything not in the model's training data
- Recent information: Documents updated after the model's training cutoff
- Large knowledge bases: More content than fits in a single context window
- Citation requirements: RAG makes it possible to surface which documents answers came from
When RAG Doesn't Make Sense
- Small, stable knowledge bases: If your documents fit in context, just include them. Retrieval adds complexity without benefit.
- gemini-2-5-pro's 1M context window: Large enough that many "RAG required" use cases can just use direct context injection. Simpler and often more accurate.
- Real-time data: RAG retrieves from an index; if data changes frequently, keeping the index current is its own problem.
RAG vs. Fine-tuning
RAG: model retrieves knowledge at inference time. Flexible, updatable, works with any base model. Good for factual knowledge and documents.
Fine-tuning: knowledge baked into model weights during training. Better for style, behavior, and domain-specific reasoning patterns. Worse for updatable factual knowledge.
Most production systems want RAG, not fine-tuning, for knowledge. Fine-tuning for style and behavior.
Implementation
Common stacks:
- Embeddings:
text-embedding-3-small(OpenAI), or local alternatives - Vector stores: Pinecone, Chroma, pgvector (Postgres extension), Weaviate
- Orchestration: LangChain, LlamaIndex for managing the retrieval pipeline
For code cursor implements RAG over your codebase natively — semantic search across repo contents is the same mechanism applied to code files.
Weaknesses
- Retrieval quality bottlenecks answer quality — garbage in, garbage out
- Chunking strategy significantly impacts what gets retrieved (too small = no context, too large = irrelevant noise)
- Adding RAG adds latency (retrieval step) and infrastructure complexity
- Multi-hop questions (answer requires connecting information from multiple documents) are hard
Related
gemini-2-5-pro · cursor · mcp · agentic-workflows