RAG (Retrieval-Augmented Generation)

Architecture

A technique that grounds LLM responses in external documents by retrieving relevant chunks from a vector store and injecting them into the prompt context at inference time.

RAG allows an LLM to answer questions about documents it was never trained on — without fine-tuning. The system searches a private knowledge base for relevant passages and stuffs them into the prompt, asking the model to "answer based on the provided context".

The RAG Pipeline

Indexing Phase (done once / on document update)

1. Load documents (PDF, DOCX, web, database). → 2. Chunk into pieces (500–1000 tokens, with overlap). → 3. Embed each chunk via an embedding model (BGE, all-MiniLM). → 4. Store vectors + metadata in vector DB (Chroma, Qdrant, pgvector).

Query Phase (per user request)

1. Embed the user query. → 2. ANN Search (top-K most similar chunks). → 3. Optional: Rerank retrieved chunks (Cohere Rerank, BGE-Reranker). → 4. Augment prompt with top N chunks. → 5. Generate answer with LLM.

Advanced RAG Patterns

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, embed it, then search. The generated answer is often closer to the target documents in embedding space than the raw question.

Parent-Child Chunking

Store small chunks for precision retrieval, but return the parent (larger) chunk to the LLM for full context. Solves the precision-vs-context tradeoff.

Self-RAG

The model learns to decide when to retrieve (not every query needs it), critique the retrieved passages, and verify its own output against them.

GraphRAG

Build a knowledge graph from the corpus. Retrieval uses graph traversal rather than pure vector similarity — better for multi-hop reasoning and entity relationships.

RAG vs Fine-Tuning

RAGFine-Tuning
Needs retraining on new data?No — re-embed docsYes
Source attributionBuilt-in (cite chunks)Difficult
HallucinationLower (grounded)Higher risk
Encyclopaedic knowledgeMust be in indexBaked into weights
Style/tone adaptationVia prompt onlyDeep adaptation

Why It Matters for On-Premise

RAG is the most practical path to adding private knowledge to an on-premise LLM. No GPU-intensive retraining. No data uploaded to cloud embeddings APIs (use local BGE or all-MiniLM). Full audit trail of which documents informed each answer. Tools: LangChain, LlamaIndex, Haystack, or a custom FastAPI + Chroma stack.