Glossario LLM On-Premise
50 termini che coprono l'intero stack AI on-premise. Clicca su qualsiasi termine per una spiegazione dettagliata.
Agentic AI
NEWAI systems that autonomously plan and execute multi-step tasks using tools, memory, and external APIs — going far beyond single-turn chat.
Air-Gapped
A system physically isolated from all external networks, including the public internet. The gold standard for on-premise data sovereignty.
AlpacaEval
Automated instruction-following benchmark: 805 prompts judged by GPT-4. Measures win rate vs. text-davinci-003 baseline. Fast, cheap, and highly correlated with Chatbot Arena Elo ratings.
ARC Challenge
AI2 Reasoning Challenge — grade-school science multiple-choice questions. The 'Challenge' set contains questions that retrieval-based and word co-occurrence systems fail on.
Attention Mechanism
The mathematical core of every transformer model — it lets the model weigh which tokens in the context are most relevant when generating each new token.
BF16 (Brain Float 16)
A 16-bit floating point format developed by Google with the same exponent range as FP32 — making it more numerically stable for training than FP16.
BIG-Bench Hard (BBH)
23 challenging tasks from the BIG-Bench suite where LLMs historically underperformed humans — covering logical reasoning, multi-step arithmetic, causal reasoning, and formal logic.
Chain-of-Thought (CoT)
A prompting strategy that instructs the model to reason step-by-step before answering, dramatically improving performance on complex tasks.
Chatbot Arena (Elo)
LMSYS Chatbot Arena — crowd-sourced side-by-side LLM battles judged by real users. Elo ratings derived from millions of human preference votes. The ground-truth human preference leaderboard.
Context Window
The maximum number of tokens (input + output combined) a model can process in a single call. Directly tied to VRAM usage via the KV cache.
DPO (Direct Preference Optimization)
NEWA simpler alignment technique than RLHF that directly fine-tunes a model on preferred vs rejected response pairs — no separate reward model needed.
Embeddings
Dense numerical vector representations of text that capture semantic meaning — the foundation of semantic search and RAG pipelines.
Fine-Tuning
Continuing to train a pre-trained model on a domain-specific dataset to permanently improve its performance on specialised tasks.
Flash Attention 2
NEWA hardware-aware attention algorithm that rewrites the attention computation to be IO-optimal, enabling 3–4× faster inference and larger context windows.
FP16 / FP32
Floating point precision formats that determine model weight storage size. FP32 for training, FP16 for GPU inference — each format halves VRAM vs the previous.
GGUF
GPT-Generated Unified Format — the binary file format used by llama.cpp to store quantized LLM weights. The standard for CPU and consumer-GPU inference.
GPQA
NEWGraduate-Level Google-Proof Q&A — 448 expert-crafted questions in biology, chemistry and physics so hard that even PhD researchers only score 65%. Designed to challenge frontier models.
GPTQ
A GPU-native post-training quantization method using second-order Hessian information to minimise weight-rounding error — typically faster on GPU than GGUF at the same bit depth.
GSM8K
Grade School Math 8K — 8,500 linguistically diverse elementary math word problems requiring multi-step arithmetic. The primary benchmark for evaluating LLM mathematical reasoning.
HellaSwag
Commonsense natural language inference: pick the most plausible continuation for an activity description. Created with adversarial filtering — humans score 95%, early models scored ~40%.
HELM
Holistic Evaluation of Language Models — 42 scenarios × 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). A comprehensive multi-dimensional evaluation framework.
HumanEval
OpenAI's code generation benchmark: 164 hand-crafted Python programming problems. Models must write a function that passes all unit tests. The canonical LLM coding benchmark.
IFEval
Instruction Following Evaluation — 500+ prompts each with verifiable formatting constraints (use N words, include keywords, write in JSON). Objective evaluation of instruction adherence.
Inference
The process of running a trained model to generate output. Inference cost is hardware electricity + depreciation per token — the operational cost of on-premise LLMs.
KV Cache
A cache of the Key and Value attention tensors for all tokens already processed — avoiding redundant recomputation and making autoregressive generation efficient.
LiveCodeBench
NEWCode generation benchmark built from problems released after model training cutoffs — preventing contamination. Continuously updated from LeetCode, Codeforces, and AtCoder.
LoRA & QLoRA
Parameter-efficient fine-tuning adapters that inject trainable low-rank matrices into transformer layers — allowing 70B model fine-tuning on consumer hardware.
MATH
Competition-level mathematics (AMC, AIME, Putnam) across 5 difficulty levels and 7 subject areas. Far harder than GSM8K — still discriminating at the frontier.
MBPP
Mostly Basic Python Programming — 500+ crowd-sourced Python problems. Simpler than HumanEval but broader coverage. Commonly paired with HumanEval for a more complete code eval picture.
Mixture of Experts (MoE)
NEWAn architecture where the model has multiple parallel FFN "expert" layers per transformer block, with a router selecting only a subset per token — giving huge parameter counts with low active compute.
MMLU
Massive Multitask Language Understanding — 57-subject multiple-choice exam covering STEM, humanities, law and more. The standard academic knowledge benchmark since 2021.
MMMU
NEWMassive Multitask Multimodal Understanding — 11,500 questions across 183 subjects requiring image + text reasoning. The MMLU equivalent for multimodal language models.
MT-Bench
Multi-Turn Benchmark — 80 challenging multi-turn conversations across 8 categories, scored by GPT-4 as judge. Introduced the LLM-as-judge paradigm, enabling scalable open-ended evaluation.
Multimodal LLM
NEWModels that process and reason across multiple input modalities — text, images, audio, and video — in a single unified architecture.
ONNX / ONNX Runtime
Open Neural Network Exchange — a portable model format that enables cross-framework interoperability, often used to accelerate inference via ONNX Runtime.
Perplexity (PPL)
A measure of how well a language model predicts a sample of text — lower perplexity means better language modelling. Used to measure quality degradation from quantization.
Prompt Engineering
The practice of designing LLM inputs to maximise output quality — including system prompts, few-shot examples, chain-of-thought triggers, and output format instructions.
Quantization
Reducing the numerical precision of model weights from FP16 to INT8 or INT4, dramatically cutting VRAM requirements with only a small trade-off in quality.
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external documents by retrieving relevant chunks from a vector store and injecting them into the prompt context at inference time.
RLHF (Reinforcement Learning from Human Feedback)
The alignment technique used to train ChatGPT-style models by learning from human preferences — combining supervised fine-tuning, reward modelling, and PPO.
Speculative Decoding
NEWA technique that uses a small draft model to predict multiple tokens ahead, which a large verifier model then accepts or rejects in parallel — achieving 2–3× faster decoding.
SWE-bench
NEWSoftware Engineering Benchmark — 2,294 real GitHub issues from popular Python repositories. Models must write a code patch that passes the repo's official test suite. The hardest coding benchmark.
System Prompt
A special instruction block prepended to every conversation that defines the model's persona, constraints, output format, and access boundaries — the foundation of any production LLM deployment.
Tokenizer
The component that converts raw text into token IDs (numbers) that the model processes — and converts token IDs back to text. A critical factor in multilingual performance and context efficiency.
Tokens/s (Throughput)
The primary performance metric for LLM inference — how many output tokens are generated per second. Determines user experience and how many concurrent users a deployment can serve.
TruthfulQA
817 questions designed to elicit false answers from LLMs — covering conspiracy theories, misconceptions, and myths. Measures a model's truthfulness rather than its knowledge breadth.
Vector Database
A specialised database that stores high-dimensional embedding vectors and enables fast approximate nearest-neighbour (ANN) search — the backbone of RAG pipelines.
vLLM
NEWA high-throughput LLM inference engine featuring PagedAttention for optimal KV cache utilisation. The production standard for serving LLMs to multiple concurrent users.
VRAM
Video RAM — the GPU's onboard memory. The single most critical hardware resource for on-premise LLM inference. All model weights and KV cache must fit inside VRAM for full GPU speed.
WinoGrande
Large-scale commonsense reasoning via pronoun disambiguation — 44,000 adversarially filtered sentence pairs. Tests whether models resolve ambiguous pronouns using world knowledge.