Local_Vocabulary
Definitions for the new stack. Understanding the components of sovereignty.
Air-Gapped
A system physically isolated from unsecured networks (the public internet). In LLM contexts, this means running inference without any callback to OpenAI, Anthropic, or HuggingFace servers. The ultimate standard for privacy.
Security
Fine-Tuning
The process of taking a pre-trained base model (like Llama 3) and training it further on a specific dataset to improve performance in a narrow domain. Unlike RAG, this changes the model's weights permanently.
Training
FP16 / FP32
Floating Point precision. FP32 (Full Precision) is standard for training. FP16 (Half Precision) is standard for inference on GPUs. Lower precision reduces VRAM usage with minimal accuracy loss.
Hardware
GGUF
GPT-Generated Unified Format. A binary file format for storing models for inference with llama.cpp. Optimized for CPU and Apple Metal performance, efficiently mapping model weights to memory.
Format
Inference
The act of using a trained model to generate predictions (text). In On-Prem terms, "Inference Cost" is the electricity and hardware depreciation required to generate tokens, as opposed to API fees.
Core
LoRA (Low-Rank Adaptation)
Parameter-Efficient Fine-Tuning (PEFT) technique. Instead of retraining all weights, LoRA trains small rank decomposition matrices. Allows fine-tuning 70B models on consumer hardware.
Training
Quantization (Q4_K_M, Q8)
Reducing the precision of model weights (e.g., from 16-bit floats to 4-bit integers). This dramatically reduces VRAM requirements (allowing a 70B model to fit on 48GB instead of 140GB) with a small trade-off in perplexity (smartness).
Optimization
RAG (Retrieval-Augmented Generation)
A technique to ground LLM responses in private data. The system searches a Vector Database for relevant documents, injects them into the prompt context, and asks the LLM to "answer based on this". The model itself does not learn the data.
Architecture
Tokens/s
The standard metric for inference speed. Reading speed (approx 30 t/s) is usually the target for chat UI. Batch processing can tolerate lower speeds.
Metric
Vector Database
A specialized database (e.g., Chroma, Milvus, Qdrant) that stores data as high-dimensional vectors (embeddings). Essential for RAG, allowing semantic search ("find concepts like this") rather than keyword search.
Infrastructure
VRAM
Video RAM. The most critical bottleneck for On-Premise LLMs. The entire model (weights + context cache) must usually fit into VRAM for acceptable performance. If it spills to System RAM, speed drops by 10-50x.
Hardware