RAG (Retrieval Augmented Generation)

RAG (Retrieval Augmented Generation)

Grounding LLM responses in external documents by retrieving relevant content at inference time. The model doesn't need to memorize everything — it looks things up when needed. Introduced by Lewis et al. in "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (NeurIPS 2020, arXiv:2005.11401).

The Basic Pattern

  1. Index documents into a vector store (embed text chunks, store embeddings)
  2. At query time, embed the user's question
  3. Retrieve the top-K chunks most similar to the question
  4. Inject retrieved chunks into the model's context
  5. Model answers based on the retrieved content + its training knowledge

This lets you use a model with a fixed training cutoff to answer questions about recent documents, proprietary knowledge, or data the model never saw.

When RAG Makes Sense

  • Private knowledge: Company docs, codebases, internal wikis — anything not in the model's training data
  • Recent information: Documents updated after the model's training cutoff
  • Large knowledge bases: More content than fits in a single context window
  • Citation requirements: RAG makes it possible to surface which documents answers came from

When RAG Doesn't Make Sense

  • Small, stable knowledge bases: If your documents fit in context, just include them. Retrieval adds complexity without benefit.
  • gemini-2-5-pro's 1M context window: Large enough that many "RAG required" use cases can just use direct context injection. Simpler and often more accurate.
  • Real-time data: RAG retrieves from an index; if data changes frequently, keeping the index current is its own problem.

RAG vs. Fine-tuning

RAG: model retrieves knowledge at inference time. Flexible, updatable, works with any base model. Good for factual knowledge and documents.

Fine-tuning: knowledge baked into model weights during training. Better for style, behavior, and domain-specific reasoning patterns. Worse for updatable factual knowledge.

Most production systems want RAG, not fine-tuning, for knowledge. Fine-tuning for style and behavior.

Implementation

Common stacks:

  • Embeddings: text-embedding-3-small (OpenAI), or local alternatives
  • Vector stores: Pinecone, Chroma, pgvector (Postgres extension), Weaviate
  • Orchestration: LangChain, LlamaIndex for managing the retrieval pipeline

For code cursor implements RAG over your codebase natively — semantic search across repo contents is the same mechanism applied to code files.

Weaknesses

  • Retrieval quality bottlenecks answer quality — garbage in, garbage out
  • Chunking strategy significantly impacts what gets retrieved (too small = no context, too large = irrelevant noise)
  • Adding RAG adds latency (retrieval step) and infrastructure complexity
  • Multi-hop questions (answer requires connecting information from multiple documents) are hard

Related

gemini-2-5-pro · cursor · mcp · agentic-workflows

Sources