RAG (Retrieval-Augmented Generation) – LLM Glossary

RAG allows an LLM to answer questions about documents it was never trained on — without fine-tuning. The system searches a private knowledge base for relevant passages and stuffs them into the prompt, asking the model to "answer based on the provided context".

The RAG Pipeline

Indexing Phase (done once / on document update)

1. Load documents (PDF, DOCX, web, database). → 2. Chunk into pieces (500–1000 tokens, with overlap). → 3. Embed each chunk via an embedding model (BGE, all-MiniLM). → 4. Store vectors + metadata in vector DB (Chroma, Qdrant, pgvector).

Query Phase (per user request)

1. Embed the user query. → 2. ANN Search (top-K most similar chunks). → 3. Optional: Rerank retrieved chunks (Cohere Rerank, BGE-Reranker). → 4. Augment prompt with top N chunks. → 5. Generate answer with LLM.

Advanced RAG Patterns

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, embed it, then search. The generated answer is often closer to the target documents in embedding space than the raw question.

Parent-Child Chunking

Store small chunks for precision retrieval, but return the parent (larger) chunk to the LLM for full context. Solves the precision-vs-context tradeoff.

Self-RAG

The model learns to decide when to retrieve (not every query needs it), critique the retrieved passages, and verify its own output against them.

GraphRAG

Build a knowledge graph from the corpus. Retrieval uses graph traversal rather than pure vector similarity — better for multi-hop reasoning and entity relationships.

RAG vs Fine-Tuning

	RAG	Fine-Tuning
Needs retraining on new data?	No — re-embed docs	Yes
Source attribution	Built-in (cite chunks)	Difficult
Hallucination	Lower (grounded)	Higher risk
Encyclopaedic knowledge	Must be in index	Baked into weights
Style/tone adaptation	Via prompt only	Deep adaptation

Why It Matters for On-Premise

RAG is the most practical path to adding private knowledge to an on-premise LLM. No GPU-intensive retraining. No data uploaded to cloud embeddings APIs (use local BGE or all-MiniLM). Full audit trail of which documents informed each answer. Tools: LangChain, LlamaIndex, Haystack, or a custom FastAPI + Chroma stack.