RAG allows an LLM to answer questions about documents it was never trained on — without fine-tuning. The system searches a private knowledge base for relevant passages and stuffs them into the prompt, asking the model to "answer based on the provided context".
The RAG Pipeline
Indexing Phase (done once / on document update)
1. Load documents (PDF, DOCX, web, database). → 2. Chunk into pieces (500–1000 tokens, with overlap). → 3. Embed each chunk via an embedding model (BGE, all-MiniLM). → 4. Store vectors + metadata in vector DB (Chroma, Qdrant, pgvector).
Query Phase (per user request)
1. Embed the user query. → 2. ANN Search (top-K most similar chunks). → 3. Optional: Rerank retrieved chunks (Cohere Rerank, BGE-Reranker). → 4. Augment prompt with top N chunks. → 5. Generate answer with LLM.
Advanced RAG Patterns
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer first, embed it, then search. The generated answer is often closer to the target documents in embedding space than the raw question.
Parent-Child Chunking
Store small chunks for precision retrieval, but return the parent (larger) chunk to the LLM for full context. Solves the precision-vs-context tradeoff.
Self-RAG
The model learns to decide when to retrieve (not every query needs it), critique the retrieved passages, and verify its own output against them.
GraphRAG
Build a knowledge graph from the corpus. Retrieval uses graph traversal rather than pure vector similarity — better for multi-hop reasoning and entity relationships.
RAG vs Fine-Tuning
| RAG | Fine-Tuning | |
|---|---|---|
| Needs retraining on new data? | No — re-embed docs | Yes |
| Source attribution | Built-in (cite chunks) | Difficult |
| Hallucination | Lower (grounded) | Higher risk |
| Encyclopaedic knowledge | Must be in index | Baked into weights |
| Style/tone adaptation | Via prompt only | Deep adaptation |
Why It Matters for On-Premise
RAG is the most practical path to adding private knowledge to an on-premise LLM. No GPU-intensive retraining. No data uploaded to cloud embeddings APIs (use local BGE or all-MiniLM). Full audit trail of which documents informed each answer. Tools: LangChain, LlamaIndex, Haystack, or a custom FastAPI + Chroma stack.