Glossario LLM On-Premise – Termini Chiave e Definizioni

Agentic AI

NEW

AI systems that autonomously plan and execute multi-step tasks using tools, memory, and external APIs — going far beyond single-turn chat.

Air-Gapped

A system physically isolated from all external networks, including the public internet. The gold standard for on-premise data sovereignty.

AlpacaEval

Automated instruction-following benchmark: 805 prompts judged by GPT-4. Measures win rate vs. text-davinci-003 baseline. Fast, cheap, and highly correlated with Chatbot Arena Elo ratings.

ARC Challenge

AI2 Reasoning Challenge — grade-school science multiple-choice questions. The 'Challenge' set contains questions that retrieval-based and word co-occurrence systems fail on.

Attention Mechanism

The mathematical core of every transformer model — it lets the model weigh which tokens in the context are most relevant when generating each new token.

BF16 (Brain Float 16)

A 16-bit floating point format developed by Google with the same exponent range as FP32 — making it more numerically stable for training than FP16.

BIG-Bench Hard (BBH)

23 challenging tasks from the BIG-Bench suite where LLMs historically underperformed humans — covering logical reasoning, multi-step arithmetic, causal reasoning, and formal logic.

Chain-of-Thought (CoT)

A prompting strategy that instructs the model to reason step-by-step before answering, dramatically improving performance on complex tasks.

Chatbot Arena (Elo)

LMSYS Chatbot Arena — crowd-sourced side-by-side LLM battles judged by real users. Elo ratings derived from millions of human preference votes. The ground-truth human preference leaderboard.

Context Window

The maximum number of tokens (input + output combined) a model can process in a single call. Directly tied to VRAM usage via the KV cache.

DPO (Direct Preference Optimization)

NEW

A simpler alignment technique than RLHF that directly fine-tunes a model on preferred vs rejected response pairs — no separate reward model needed.

Embeddings

Dense numerical vector representations of text that capture semantic meaning — the foundation of semantic search and RAG pipelines.

Fine-Tuning

Continuing to train a pre-trained model on a domain-specific dataset to permanently improve its performance on specialised tasks.

Flash Attention 2

NEW

A hardware-aware attention algorithm that rewrites the attention computation to be IO-optimal, enabling 3–4× faster inference and larger context windows.

FP16 / FP32

Floating point precision formats that determine model weight storage size. FP32 for training, FP16 for GPU inference — each format halves VRAM vs the previous.

GGUF

GPT-Generated Unified Format — the binary file format used by llama.cpp to store quantized LLM weights. The standard for CPU and consumer-GPU inference.

GPQA

NEW

Graduate-Level Google-Proof Q&A — 448 expert-crafted questions in biology, chemistry and physics so hard that even PhD researchers only score 65%. Designed to challenge frontier models.

GPTQ

A GPU-native post-training quantization method using second-order Hessian information to minimise weight-rounding error — typically faster on GPU than GGUF at the same bit depth.

GSM8K

Grade School Math 8K — 8,500 linguistically diverse elementary math word problems requiring multi-step arithmetic. The primary benchmark for evaluating LLM mathematical reasoning.

HellaSwag

Commonsense natural language inference: pick the most plausible continuation for an activity description. Created with adversarial filtering — humans score 95%, early models scored ~40%.

HELM

Holistic Evaluation of Language Models — 42 scenarios × 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). A comprehensive multi-dimensional evaluation framework.

HumanEval

OpenAI's code generation benchmark: 164 hand-crafted Python programming problems. Models must write a function that passes all unit tests. The canonical LLM coding benchmark.

IFEval

Instruction Following Evaluation — 500+ prompts each with verifiable formatting constraints (use N words, include keywords, write in JSON). Objective evaluation of instruction adherence.

Inference

The process of running a trained model to generate output. Inference cost is hardware electricity + depreciation per token — the operational cost of on-premise LLMs.

KV Cache

A cache of the Key and Value attention tensors for all tokens already processed — avoiding redundant recomputation and making autoregressive generation efficient.

LiveCodeBench

NEW

Code generation benchmark built from problems released after model training cutoffs — preventing contamination. Continuously updated from LeetCode, Codeforces, and AtCoder.

LoRA & QLoRA

Parameter-efficient fine-tuning adapters that inject trainable low-rank matrices into transformer layers — allowing 70B model fine-tuning on consumer hardware.

MATH

Competition-level mathematics (AMC, AIME, Putnam) across 5 difficulty levels and 7 subject areas. Far harder than GSM8K — still discriminating at the frontier.

MBPP

Mostly Basic Python Programming — 500+ crowd-sourced Python problems. Simpler than HumanEval but broader coverage. Commonly paired with HumanEval for a more complete code eval picture.

Mixture of Experts (MoE)

NEW

An architecture where the model has multiple parallel FFN "expert" layers per transformer block, with a router selecting only a subset per token — giving huge parameter counts with low active compute.

MMLU

Massive Multitask Language Understanding — 57-subject multiple-choice exam covering STEM, humanities, law and more. The standard academic knowledge benchmark since 2021.

MMMU

NEW

Massive Multitask Multimodal Understanding — 11,500 questions across 183 subjects requiring image + text reasoning. The MMLU equivalent for multimodal language models.

MT-Bench

Multi-Turn Benchmark — 80 challenging multi-turn conversations across 8 categories, scored by GPT-4 as judge. Introduced the LLM-as-judge paradigm, enabling scalable open-ended evaluation.

Multimodal LLM

NEW

Models that process and reason across multiple input modalities — text, images, audio, and video — in a single unified architecture.

ONNX / ONNX Runtime

Open Neural Network Exchange — a portable model format that enables cross-framework interoperability, often used to accelerate inference via ONNX Runtime.

Perplexity (PPL)

A measure of how well a language model predicts a sample of text — lower perplexity means better language modelling. Used to measure quality degradation from quantization.

Prompt Engineering

The practice of designing LLM inputs to maximise output quality — including system prompts, few-shot examples, chain-of-thought triggers, and output format instructions.

Quantization

Reducing the numerical precision of model weights from FP16 to INT8 or INT4, dramatically cutting VRAM requirements with only a small trade-off in quality.

RAG (Retrieval-Augmented Generation)

A technique that grounds LLM responses in external documents by retrieving relevant chunks from a vector store and injecting them into the prompt context at inference time.

RLHF (Reinforcement Learning from Human Feedback)

The alignment technique used to train ChatGPT-style models by learning from human preferences — combining supervised fine-tuning, reward modelling, and PPO.

Speculative Decoding

NEW

A technique that uses a small draft model to predict multiple tokens ahead, which a large verifier model then accepts or rejects in parallel — achieving 2–3× faster decoding.

SWE-bench

NEW

Software Engineering Benchmark — 2,294 real GitHub issues from popular Python repositories. Models must write a code patch that passes the repo's official test suite. The hardest coding benchmark.

System Prompt

A special instruction block prepended to every conversation that defines the model's persona, constraints, output format, and access boundaries — the foundation of any production LLM deployment.

Tokenizer

The component that converts raw text into token IDs (numbers) that the model processes — and converts token IDs back to text. A critical factor in multilingual performance and context efficiency.

Tokens/s (Throughput)

The primary performance metric for LLM inference — how many output tokens are generated per second. Determines user experience and how many concurrent users a deployment can serve.

TruthfulQA

817 questions designed to elicit false answers from LLMs — covering conspiracy theories, misconceptions, and myths. Measures a model's truthfulness rather than its knowledge breadth.

Vector Database

A specialised database that stores high-dimensional embedding vectors and enables fast approximate nearest-neighbour (ANN) search — the backbone of RAG pipelines.

vLLM

NEW

A high-throughput LLM inference engine featuring PagedAttention for optimal KV cache utilisation. The production standard for serving LLMs to multiple concurrent users.

VRAM

Video RAM — the GPU's onboard memory. The single most critical hardware resource for on-premise LLM inference. All model weights and KV cache must fit inside VRAM for full GPU speed.

WinoGrande

Large-scale commonsense reasoning via pronoun disambiguation — 44,000 adversarially filtered sentence pairs. Tests whether models resolve ambiguous pronouns using world knowledge.