RAG (Retrieval Augmented Generation)

2 min · concept

RAG (Retrieval Augmented Generation)

Grounding LLM responses in external documents by retrieving relevant content at inference time. The model doesn't need to memorize everything — it looks things up when needed. Introduced by Lewis et al. in "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020, arXiv:2005.11401).

The Basic Pattern

Index documents into a vector store (embed text chunks, store embeddings)
At query time, embed the user's question
Retrieve the top-K chunks most similar to the question
Inject retrieved chunks into the model's context
Model answers based on the retrieved content + its training knowledge

This lets you use a model with a fixed training cutoff to answer questions about recent documents, proprietary knowledge, or data the model never saw.

When RAG Makes Sense

Private knowledge: Company docs, codebases, internal wikis — anything not in the model's training data
Recent information: Documents updated after the model's training cutoff
Large knowledge bases: More content than fits in a single context window
Citation requirements: RAG makes it possible to surface which documents answers came from

When RAG Doesn't Make Sense

Small, stable knowledge bases: If your documents fit in context, just include them. Retrieval adds complexity without benefit.
gemini-2-5-pro's 1M context window: Large enough that many "RAG required" use cases can just use direct context injection. Simpler and often more accurate.
Real-time data: RAG retrieves from an index; if data changes frequently, keeping the index current is its own problem.

RAG vs. Fine-tuning

RAG: model retrieves knowledge at inference time. Flexible, updatable, works with any base model. Good for factual knowledge and documents.

Fine-tuning: knowledge baked into model weights during training. Better for style, behavior, and domain-specific reasoning patterns. Worse for updatable factual knowledge.

Most production systems want RAG, not fine-tuning, for knowledge. Fine-tuning for style and behavior.

Implementation

Common stacks:

Embeddings: text-embedding-3-small (OpenAI), or local alternatives
Vector stores: Pinecone, Chroma, pgvector (Postgres extension), Weaviate
Orchestration: LangChain, LlamaIndex for managing the retrieval pipeline

For code cursor implements RAG over your codebase natively — semantic search across repo contents is the same mechanism applied to code files.

Weaknesses

Retrieval quality bottlenecks answer quality — garbage in, garbage out
Chunking strategy significantly impacts what gets retrieved (too small = no context, too large = irrelevant noise)
Adding RAG adds latency (retrieval step) and infrastructure complexity
Multi-hop questions (answer requires connecting information from multiple documents) are hard

gemini-2-5-pro · cursor · mcp · agentic-workflows

Sources

linked from

Context Window MCP (Model Context Protocol)Gemini 2.5 Pro Gemini 2.5 Ultra Cursor LM Studio

RAG (Retrieval Augmented Generation)

RAG (Retrieval Augmented Generation)

The Basic Pattern

When RAG Makes Sense

When RAG Doesn't Make Sense

RAG vs. Fine-tuning

Implementation

Weaknesses

Related

Sources