## Speculative Decoding: More Efficient LLM Inference LLM inference is often perceived as slow due to the complexity of matrix multiplications. In reality, for local inference or chat with a batch size of 1, the main bottleneck is memory bandwidth. Transferring model weights from VRAM to compute units takes up most of the time, leaving Arithmetic Logic Units (ALUs) idle. Speculative Decoding exploits this idle time to offer a 2x to 3x speed increase while keeping the results mathematically identical. ### How it works 1. **Setup: Drafter vs. Target** A small "Drafter" model (e.g., 100M parameters) is used alongside the larger "Target" model (e.g., Llama-70B). The Drafter model quickly generates a sequence of tokens. 2. **Parallel Verification** The tokens generated by the Drafter model are fed into the Target model in a single pass. Since inference is memory-bound, loading the weights for one token takes about the same time as loading for multiple tokens. 3. **Rejection Sampling** Rejection sampling is used to ensure that the distribution exactly matches that of the Target model. If the probability estimated by the Drafter model is lower than that of the Target model, the token is accepted. Otherwise, it is rejected with a certain probability. Even if only some tokens are accepted, there is still an efficiency advantage. Speculative Decoding transforms a memory-bound operation into a compute-bound one, optimizing hardware utilization without requiring model retraining.

Faster LLM Inference with Speculative Decoding

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

vLLM rilascia la versione 0.14.0: ottimizzazione dei LLM

Siccofanti digitali: i modelli linguistici sono davvero allineati?

Sistemi multi-agente LLM: più voci non sempre migliorano la qualità