RAG
Retrieval-Augmented Generation - Technique that combines document retrieval with LLMs to generate responses based on up-to-date and specific information.
Pronunciation
What it is
RAG (Retrieval-Augmented Generation) is an AI technique that:
- Searches for relevant information in your documents
- Retrieves the most useful fragments
- Augments the LLM prompt with that context
- Generates responses based on real data
RAG solves the problem that LLMs don’t know your private information or recent data.
Pronunciation
IPA: /ræɡ/
Sounds like: “rag” - like the cloth, one syllable
Common mistakes:
- “R-A-G” (spelled out) - incorrect
- “rahg” (long ‘a’) - incorrect
Why RAG matters
Without RAG
User: "What is my company's vacation policy?"
LLM: "Vacation policies vary by company.
Generally they include 15-20 days per year..."
[Generic response, doesn't know YOUR company]
With RAG
User: "What is my company's vacation policy?"
RAG System:
1. Searches HR documents
2. Finds: vacation_policy_2026.pdf
3. Extracts: "20 business days + 5 for tenure"
LLM + Context: "According to current policy, you have 20
business days of vacation, plus 5 additional
days for your 3 years of tenure. The request
process is..."
[Specific response with real data]
How it works
┌─────────────────────────────────────────────────────────────┐
│ RAG ARCHITECTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. INDEXING (preparation, once) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Documents → Chunks → Embeddings → Vector DB │ │
│ │ │ │
│ │ PDF, Word, Wiki Fragments Numerical vectors │ │
│ │ Confluence, Notion of ~500 tokens representing │ │
│ │ Code, FAQs meaning │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ 2. QUERY (each question) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Question ──→ Embedding ──→ Vector Search │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ "How do [0.2, 0.8, Top 5 most │ │
│ │ I setup 0.1, ...] similar chunks │ │
│ │ SSO?" │ │
│ │ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ 3. GENERATION │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Prompt = Question + Retrieved context │ │
│ │ │ │
│ │ "Based on this documentation: │ │
│ │ [chunk1] [chunk2] [chunk3] │ │
│ │ Answer: How do I setup SSO?" │ │
│ │ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ LLM generates grounded response │ │
│ │ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Key components
| Component | Function | Examples |
|---|---|---|
| Embeddings | Convert text to vectors | OpenAI Ada, Cohere, Sentence-BERT |
| Vector DB | Store and search vectors | Pinecone, Weaviate, Chroma, pgvector |
| Chunking | Split documents | By paragraphs, semantic, hybrid |
| Reranking | Improve relevance | Cohere Rerank, Cross-encoders |
| LLM | Generate response | GPT-4, Claude, Llama |
Practical example: Support chatbot
// Simplified example with LangChain
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { PineconeStore } from "langchain/vectorstores/pinecone";
import { ChatOpenAI } from "langchain/chat_models/openai";
// 1. Search relevant documents
const vectorStore = await PineconeStore.fromExisting(
new OpenAIEmbeddings(),
{ pineconeIndex }
);
const relevantDocs = await vectorStore.similaritySearch(
"How do I reset my password?",
5 // top 5 results
);
// 2. Generate response with context
const llm = new ChatOpenAI({ modelName: "gpt-4" });
const response = await llm.invoke([
{
role: "system",
content: `You are a support agent. Use ONLY this information:
${relevantDocs.map(d => d.pageContent).join('\n\n')}`
},
{
role: "user",
content: "How do I reset my password?"
}
]);
Advantages vs Fine-tuning
| Aspect | RAG | Fine-tuning |
|---|---|---|
| Updates | Instant | Requires re-training |
| Cost | Low (inference only) | High (training) |
| Traceability | Cites sources | ”Black box” |
| Private data | Stays local | Incorporated into model |
| Best for | FAQ, documentation | Tone, specific format |
Best practices
Do
- Use chunks of 200-500 tokens
- Implement reranking for better precision
- Include metadata (date, source, author)
- Version your indexes
- Filter by minimum relevance
Don’t
- Chunks too large (loses relevance)
- Chunks too small (loses context)
- Ignore source document quality
- Omit handling “I don’t know”
Important metrics
┌────────────────────────────────────────┐
│ KPIs FOR RAG │
├────────────────────────────────────────┤
│ │
│ Retrieval: │
│ - Precision@K: % relevant in top K │
│ - Recall: % documents found │
│ - MRR: Position of first relevant │
│ │
│ Generation: │
│ - Faithfulness: Fidelity to context │
│ - Answer relevancy: Useful to user │
│ - Hallucination rate: Made-up info │
│ │
└────────────────────────────────────────┘
Related terms
- [[LLM]] - The model that generates responses
- [[Agentic AI]] - Agents that use RAG for complex tasks
- [[MCP]] - Protocol to connect RAG to multiple sources
Remember: RAG is only as good as your documents. Garbage in, garbage out. Invest in the quality and organization of your knowledge base.