RAG - Tech Dictionary | Nextsoft Corp

What it is

RAG (Retrieval-Augmented Generation) is an AI technique that:

Searches for relevant information in your documents
Retrieves the most useful fragments
Augments the LLM prompt with that context
Generates responses based on real data

RAG solves the problem that LLMs don’t know your private information or recent data.

Pronunciation

IPA: /ræɡ/

Sounds like: “rag” - like the cloth, one syllable

Common mistakes:

“R-A-G” (spelled out) - incorrect
“rahg” (long ‘a’) - incorrect

Why RAG matters

Without RAG

User: "What is my company's vacation policy?"

LLM: "Vacation policies vary by company.
      Generally they include 15-20 days per year..."

[Generic response, doesn't know YOUR company]

With RAG

User: "What is my company's vacation policy?"

RAG System:
  1. Searches HR documents
  2. Finds: vacation_policy_2026.pdf
  3. Extracts: "20 business days + 5 for tenure"

LLM + Context: "According to current policy, you have 20
                business days of vacation, plus 5 additional
                days for your 3 years of tenure. The request
                process is..."

[Specific response with real data]

How it works

┌─────────────────────────────────────────────────────────────┐
│                    RAG ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   1. INDEXING (preparation, once)                           │
│   ┌──────────────────────────────────────────────────────┐  │
│   │  Documents → Chunks → Embeddings → Vector DB         │  │
│   │                                                       │  │
│   │  PDF, Word, Wiki    Fragments    Numerical vectors   │  │
│   │  Confluence, Notion  of ~500 tokens  representing    │  │
│   │  Code, FAQs                        meaning            │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                              │
│   2. QUERY (each question)                                  │
│   ┌──────────────────────────────────────────────────────┐  │
│   │                                                       │  │
│   │  Question ──→ Embedding ──→ Vector Search           │  │
│   │     │              │              │                   │  │
│   │     ▼              ▼              ▼                   │  │
│   │  "How do      [0.2, 0.8,    Top 5 most             │  │
│   │   I setup      0.1, ...]    similar chunks          │  │
│   │   SSO?"                                               │  │
│   │                                                       │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                              │
│   3. GENERATION                                             │
│   ┌──────────────────────────────────────────────────────┐  │
│   │                                                       │  │
│   │  Prompt = Question + Retrieved context               │  │
│   │                                                       │  │
│   │  "Based on this documentation:                       │  │
│   │   [chunk1] [chunk2] [chunk3]                         │  │
│   │   Answer: How do I setup SSO?"                       │  │
│   │                                                       │  │
│   │           │                                           │  │
│   │           ▼                                           │  │
│   │        LLM generates grounded response               │  │
│   │                                                       │  │
│   └──────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key components

Component	Function	Examples
Embeddings	Convert text to vectors	OpenAI Ada, Cohere, Sentence-BERT
Vector DB	Store and search vectors	Pinecone, Weaviate, Chroma, pgvector
Chunking	Split documents	By paragraphs, semantic, hybrid
Reranking	Improve relevance	Cohere Rerank, Cross-encoders
LLM	Generate response	GPT-4, Claude, Llama

Practical example: Support chatbot

// Simplified example with LangChain
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { PineconeStore } from "langchain/vectorstores/pinecone";
import { ChatOpenAI } from "langchain/chat_models/openai";

// 1. Search relevant documents
const vectorStore = await PineconeStore.fromExisting(
  new OpenAIEmbeddings(),
  { pineconeIndex }
);

const relevantDocs = await vectorStore.similaritySearch(
  "How do I reset my password?",
  5  // top 5 results
);

// 2. Generate response with context
const llm = new ChatOpenAI({ modelName: "gpt-4" });

const response = await llm.invoke([
  {
    role: "system",
    content: `You are a support agent. Use ONLY this information:
              ${relevantDocs.map(d => d.pageContent).join('\n\n')}`
  },
  {
    role: "user",
    content: "How do I reset my password?"
  }
]);

Advantages vs Fine-tuning

Aspect	RAG	Fine-tuning
Updates	Instant	Requires re-training
Cost	Low (inference only)	High (training)
Traceability	Cites sources	”Black box”
Private data	Stays local	Incorporated into model
Best for	FAQ, documentation	Tone, specific format

Best practices

Do

Use chunks of 200-500 tokens
Implement reranking for better precision
Include metadata (date, source, author)
Version your indexes
Filter by minimum relevance

Don’t

Chunks too large (loses relevance)
Chunks too small (loses context)
Ignore source document quality
Omit handling “I don’t know”

Important metrics

┌────────────────────────────────────────┐
│   KPIs FOR RAG                          │
├────────────────────────────────────────┤
│                                         │
│   Retrieval:                           │
│   - Precision@K: % relevant in top K   │
│   - Recall: % documents found          │
│   - MRR: Position of first relevant    │
│                                         │
│   Generation:                          │
│   - Faithfulness: Fidelity to context  │
│   - Answer relevancy: Useful to user   │
│   - Hallucination rate: Made-up info   │
│                                         │
└────────────────────────────────────────┘

[[LLM]] - The model that generates responses
[[Agentic AI]] - Agents that use RAG for complex tasks
[[MCP]] - Protocol to connect RAG to multiple sources

Remember: RAG is only as good as your documents. Garbage in, garbage out. Invest in the quality and organization of your knowledge base.