## Top-K Optimization for LLMs: Superior Performance to PyTorch A developer has announced the creation of an optimized Top-K implementation for sampling in large language models (LLMs), achieving significant speed improvements compared to PyTorch on CPU. The implementation leverages AVX2, an extension of the x86 instruction set, to accelerate calculations. ## Benchmarks and Results Benchmarks show that the optimized implementation is 4 to 20 times faster than PyTorch, depending on the vocabulary size. For example, with a vocabulary of 256,000 words, the custom implementation achieved a speed of 0.079ms, compared to PyTorch's 1.56ms. Integration into llama.cpp led to a 63% increase in prompt processing speed on a 120 billion parameter MoE model (from 81 to 142 tokens per second). ## Technical Details The implementation uses adaptive sampling techniques, AVX2 SIMD, and cache-optimized scanning. It includes fast paths for sorted or constant inputs. The algorithm is single-pass and does not require a GPU. Pre-built DLLs and the llama.cpp implementation (for Windows) are also included. The source code is available on GitHub. Large language models (LLMs) require enormous computational power for training and inference. Optimizing key algorithms, such as Top-K, is critical to improving performance and reducing computational costs.

Top-K: Optimized Algorithm Up to 20x Faster Than PyTorch

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

PyTorch 2.10: supporto migliorato per GPU AMD, Intel e NVIDIA

Rivoluzione quantistica nei modelli LLM: CodeGEMM

LLM: Troppa memoria KV penalizza performance e qualità?