LLM OnPremise - Hardware & Infrastructure

Air-Gapped

A system physically isolated from unsecured networks (the public internet). In LLM contexts, this means running inference without any callback to OpenAI, Anthropic, or HuggingFace servers. The ultimate standard for privacy.

Security

Fine-Tuning

The process of taking a pre-trained base model (like Llama 3) and training it further on a specific dataset to improve performance in a narrow domain. Unlike RAG, this changes the model's weights permanently.

Training

FP16 / FP32

Floating Point precision. FP32 (Full Precision) is standard for training. FP16 (Half Precision) is standard for inference on GPUs. Lower precision reduces VRAM usage with minimal accuracy loss.

Hardware

GGUF

GPT-Generated Unified Format. A binary file format for storing models for inference with llama.cpp. Optimized for CPU and Apple Metal performance, efficiently mapping model weights to memory.

Format

Inference

The act of using a trained model to generate predictions (text). In On-Prem terms, "Inference Cost" is the electricity and hardware depreciation required to generate tokens, as opposed to API fees.

Core

LoRA (Low-Rank Adaptation)

Parameter-Efficient Fine-Tuning (PEFT) technique. Instead of retraining all weights, LoRA trains small rank decomposition matrices. Allows fine-tuning 70B models on consumer hardware.

Training

Quantization (Q4_K_M, Q8)

Reducing the precision of model weights (e.g., from 16-bit floats to 4-bit integers). This dramatically reduces VRAM requirements (allowing a 70B model to fit on 48GB instead of 140GB) with a small trade-off in perplexity (smartness).

Optimization

RAG (Retrieval-Augmented Generation)

A technique to ground LLM responses in private data. The system searches a Vector Database for relevant documents, injects them into the prompt context, and asks the LLM to "answer based on this". The model itself does not learn the data.

Architecture

Tokens/s

The standard metric for inference speed. Reading speed (approx 30 t/s) is usually the target for chat UI. Batch processing can tolerate lower speeds.

Metric

Vector Database

A specialized database (e.g., Chroma, Milvus, Qdrant) that stores data as high-dimensional vectors (embeddings). Essential for RAG, allowing semantic search ("find concepts like this") rather than keyword search.

Infrastructure

VRAM

Video RAM. The most critical bottleneck for On-Premise LLMs. The entire model (weights + context cache) must usually fit into VRAM for acceptable performance. If it spills to System RAM, speed drops by 10-50x.