Speculative decoding (Leviathan et al., 2022) exploits the fact that the autoregressive decode bottleneck is memory bandwidth, not compute. A small "draft" model generates K candidate tokens cheaply; the large target model verifies all K in a single forward pass. Accepted tokens are kept; the first rejected token triggers correction.
How It Works
1. Draft model generates K tokens speculatively (e.g., K=5). 2. The target model processes all K tokens in parallel (one forward pass costs ≈ the same as one token generation). 3. Tokens are accepted if target probability ≥ draft probability (rejection sampling). 4. On average, α tokens are accepted per draft step (acceptance rate α depends on how well the draft model matches the target). 5. Speed-up ≈ K × α.
Variants
Model-based (Classic)
A small distilled model (e.g., Llama 70B target + Llama 8B draft) speculates. Best speed-up (~2–3×). Draft must be from the same family for high acceptance rates.
n-gram / Prompt Lookup
Draft by repeating n-grams found in the prompt (copy mechanism). Zero cost; works well for summarisation and code continuation where output mirrors input. Native in vLLM --speculative-model ngram.
MEDUSA
Add multiple decoding heads directly to the target model (no separate draft model). Each head predicts future tokens at increasing offsets. Simpler deployment, 1.5–2× speed-up.
Self-Speculation
Use early exit layers of the target model as the draft. No second model needed. Llama 70B can self-speculate using its first 32 of 80 layers.
Why It Matters for On-Premise
Speculative decoding is a free latency improvement if you have a model family with matching small and large variants. vLLM enables it with one flag: --speculative-model meta-llama/Llama-3.2-1B-Instruct --num-speculative-tokens 5. For single-GPU setups where the draft model won't fit alongside the target model, ngram speculation is a zero-cost alternative that still helps on repetitive content.