Architecture Advanced

RAG

Retrieval-Augmented Generation - Technique that combines document retrieval with LLMs to generate responses based on up-to-date and specific information.

Pronunciation

/ræɡ/
"rag"
Listen on: Forvo

What it is

RAG (Retrieval-Augmented Generation) is an AI technique that:

  1. Searches for relevant information in your documents
  2. Retrieves the most useful fragments
  3. Augments the LLM prompt with that context
  4. Generates responses based on real data

RAG solves the problem that LLMs don’t know your private information or recent data.

Pronunciation

IPA: /ræɡ/

Sounds like: “rag” - like the cloth, one syllable

Common mistakes:

  • “R-A-G” (spelled out) - incorrect
  • “rahg” (long ‘a’) - incorrect

Why RAG matters

Without RAG

User: "What is my company's vacation policy?"

LLM: "Vacation policies vary by company.
      Generally they include 15-20 days per year..."

[Generic response, doesn't know YOUR company]

With RAG

User: "What is my company's vacation policy?"

RAG System:
  1. Searches HR documents
  2. Finds: vacation_policy_2026.pdf
  3. Extracts: "20 business days + 5 for tenure"

LLM + Context: "According to current policy, you have 20
                business days of vacation, plus 5 additional
                days for your 3 years of tenure. The request
                process is..."

[Specific response with real data]

How it works

┌─────────────────────────────────────────────────────────────┐
│                    RAG ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   1. INDEXING (preparation, once)                           │
│   ┌──────────────────────────────────────────────────────┐  │
│   │  Documents → Chunks → Embeddings → Vector DB         │  │
│   │                                                       │  │
│   │  PDF, Word, Wiki    Fragments    Numerical vectors   │  │
│   │  Confluence, Notion  of ~500 tokens  representing    │  │
│   │  Code, FAQs                        meaning            │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                              │
│   2. QUERY (each question)                                  │
│   ┌──────────────────────────────────────────────────────┐  │
│   │                                                       │  │
│   │  Question ──→ Embedding ──→ Vector Search           │  │
│   │     │              │              │                   │  │
│   │     ▼              ▼              ▼                   │  │
│   │  "How do      [0.2, 0.8,    Top 5 most             │  │
│   │   I setup      0.1, ...]    similar chunks          │  │
│   │   SSO?"                                               │  │
│   │                                                       │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                              │
│   3. GENERATION                                             │
│   ┌──────────────────────────────────────────────────────┐  │
│   │                                                       │  │
│   │  Prompt = Question + Retrieved context               │  │
│   │                                                       │  │
│   │  "Based on this documentation:                       │  │
│   │   [chunk1] [chunk2] [chunk3]                         │  │
│   │   Answer: How do I setup SSO?"                       │  │
│   │                                                       │  │
│   │           │                                           │  │
│   │           ▼                                           │  │
│   │        LLM generates grounded response               │  │
│   │                                                       │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key components

ComponentFunctionExamples
EmbeddingsConvert text to vectorsOpenAI Ada, Cohere, Sentence-BERT
Vector DBStore and search vectorsPinecone, Weaviate, Chroma, pgvector
ChunkingSplit documentsBy paragraphs, semantic, hybrid
RerankingImprove relevanceCohere Rerank, Cross-encoders
LLMGenerate responseGPT-4, Claude, Llama

Practical example: Support chatbot

// Simplified example with LangChain
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { PineconeStore } from "langchain/vectorstores/pinecone";
import { ChatOpenAI } from "langchain/chat_models/openai";

// 1. Search relevant documents
const vectorStore = await PineconeStore.fromExisting(
  new OpenAIEmbeddings(),
  { pineconeIndex }
);

const relevantDocs = await vectorStore.similaritySearch(
  "How do I reset my password?",
  5  // top 5 results
);

// 2. Generate response with context
const llm = new ChatOpenAI({ modelName: "gpt-4" });

const response = await llm.invoke([
  {
    role: "system",
    content: `You are a support agent. Use ONLY this information:
              ${relevantDocs.map(d => d.pageContent).join('\n\n')}`
  },
  {
    role: "user",
    content: "How do I reset my password?"
  }
]);

Advantages vs Fine-tuning

AspectRAGFine-tuning
UpdatesInstantRequires re-training
CostLow (inference only)High (training)
TraceabilityCites sources”Black box”
Private dataStays localIncorporated into model
Best forFAQ, documentationTone, specific format

Best practices

Do

  • Use chunks of 200-500 tokens
  • Implement reranking for better precision
  • Include metadata (date, source, author)
  • Version your indexes
  • Filter by minimum relevance

Don’t

  • Chunks too large (loses relevance)
  • Chunks too small (loses context)
  • Ignore source document quality
  • Omit handling “I don’t know”

Important metrics

┌────────────────────────────────────────┐
│   KPIs FOR RAG                          │
├────────────────────────────────────────┤
│                                         │
│   Retrieval:                           │
│   - Precision@K: % relevant in top K   │
│   - Recall: % documents found          │
│   - MRR: Position of first relevant    │
│                                         │
│   Generation:                          │
│   - Faithfulness: Fidelity to context  │
│   - Answer relevancy: Useful to user   │
│   - Hallucination rate: Made-up info   │
│                                         │
└────────────────────────────────────────┘
  • [[LLM]] - The model that generates responses
  • [[Agentic AI]] - Agents that use RAG for complex tasks
  • [[MCP]] - Protocol to connect RAG to multiple sources

Remember: RAG is only as good as your documents. Garbage in, garbage out. Invest in the quality and organization of your knowledge base.