Llama.cpp: 50% faster token generation on M3 Max by cutting a useless softmax

A 50% leap in tokens per second on an M3 Max MacBook Pro. Not a new chip, but a single change to the code of llama.cpp, the inference engine powering most self-hosted LLMs. TimNN, with pull request #22645, found a spot where the Top-N-Sigma sampler was unconditionally running a softmax and sort, only to discard the result when the next sampler in the pipeline was Dist.

The impact is measurable: with google_gemma-4-E4B-it-Q8_0, an 8-bit quantized model with 4 billion active parameters, throughput jumps from roughly 30 to about 45 tokens per second, shaving 10 milliseconds off the time per token. On consumer hardware, 10 milliseconds per token is the difference between a smooth interaction and a frustrating one.

The sampler chain: where the waste hides

In llama.cpp’s architecture, generating the next token is handled by a sequence of samplers, each narrowing or reshaping the probability distribution over candidate tokens. Top-N-Sigma is one of them: it picks tokens whose probability exceeds a threshold derived from the standard deviation, but to do so it needs an already sorted distribution. Until now, at the end of its work, it always performed a full softmax and sort – expensive operations in terms of both compute and memory access, especially on devices without the massive bandwidth of a server GPU.

The trouble is that Top-N-Sigma is often paired with the Dist sampler, which applies a temperature-adjusted distribution and doesn’t need that sorting. Doing a softmax at the end of Top-N-Sigma and then recalculating everything in the next step is like repainting a room before tearing down the wall: wasted effort.

What exactly changes in PR #22645

The change adds a check: if the next sampler in the chain is Dist (or any sampler that doesn’t use the sorted output), Top-N-Sigma skips the final softmax and sort. The result is a computational saving that translates directly into higher inference speed. The code has not yet been merged into the main branch, and the author himself cautions that he doesn’t fully know the API contract between chained samplers, leaving open the possibility of side effects on other sampling configurations.

Real impact for local model deployments

For anyone running on-premise stacks – whether a single developer on an Apple Silicon laptop or an organization keeping data within its own servers – every percentage point of efficiency matters. This isn’t just about comfort: fewer milliseconds per token mean lower energy consumption on the same workload, and the ability to use larger models or more conservative quantizations without dipping below a usability threshold. On an M3 Max MacBook, the observed gain is 10 ms per token, but on less powerful hardware or in contexts where inference is multiplied across hundreds of requests, the aggregate saving can become significant.

This optimization also fits into a broader trend: the community around llama.cpp is systematically shaving inefficiencies from local inference, bringing consumer hardware performance closer to cloud solutions. Each improvement shifts the balance toward hybrid or fully self-hosted setups, where data sovereignty and predictable total cost of ownership (TCO) are top priorities.

Limits and unknowns

Not all models and not all sampling chains will benefit from this change. If Top-N-Sigma is used alone or paired with samplers that expect an already sorted distribution, skipping the softmax could break the sampling logic. The author has provided no benchmarks on other backends (CUDA, Vulkan, Metal) or on other model architectures, and the community is waiting for broader testing. It remains to be seen whether the behavior is identical across all backends and whether the change can be generalized to every sampler combination without adverse effects.

What is certain is that the direction is right: paying close attention to what gets thrown away in computation, because on-premise inference cannot afford waste. For those evaluating on-premise deployments, AI-RADAR provides analytical frameworks to compare performance and cost trade-offs, starting precisely from incremental improvements like this one.