Optimizing LLM Inference on Dedicated Hardware: MTP vs DFlash Comparison

Optimizing performance for Large Language Model (LLM) inference on self-hosted infrastructures is a critical challenge for CTOs and system architects. The ability to maximize throughput and reduce latency on dedicated hardware, such as a single GPU, directly impacts the Total Cost of Ownership (TCO) and the feasibility of on-premise deployments. In this context, speculative decoding techniques emerge as fundamental tools for accelerating token generation.

A recent benchmark compared two speculative decoding approaches, Google's Multi-Token Prediction (MTP) and z-lab's DFlash, applying them to Google's Gemma 4 models. The analysis focused on inference performance for both the dense and Mixture-of-Experts (MoE) versions of the models, using a single NVIDIA H100 80GB GPU. This type of study provides concrete data for those making informed decisions about LLM deployments in controlled environments with specific data sovereignty requirements.

Benchmark Setup and Methodology

The test was conducted on a single NVIDIA H100 80GB GPU, a common hardware configuration for on-premise deployments requiring high performance. For runtime management, vLLM was used, a framework known for its efficiency in LLM inference. The dataset employed for qualitative evaluation was NVIDIA SPEED-Bench, comprising 880 prompts distributed across 11 categories, ensuring diverse workload coverage.

The models benchmarked included google/gemma-4-31B-it (dense version) and google/gemma-4-26B-A4B-it (MoE version). For MTP, Google's assistant models were used with num_speculative_tokens=8, while for DFlash, z-lab's DFlash models were employed with num_speculative_tokens=15. The context length and maximum model length were set to 32768 tokens, with a temperature of 0 and prefix caching disabled. This configuration aims to simulate real-world usage scenarios, providing a useful reference point for IT specialists.

Analysis of Results and Technical Implications

The benchmark results revealed significant differences between the two approaches and model architectures. For the Gemma 4 31B dense model, MTP showed a 3.11x acceleration and DFlash a 3.03x acceleration compared to baseline decoding at concurrency 1. At this level, the baseline achieved 40.3 output tok/s, MTP 125.3 output tok/s, and DFlash 122.1 output tok/s. At concurrency 16, the baseline reached 375 tok/s, MTP 953 tok/s, and DFlash 725 tok/s. In this scenario, MTP outperformed DFlash, especially under higher workloads.

The picture flipped for the Gemma 4 26B-A4B MoE model. Here, DFlash was 1.73x faster and MTP 1.49x faster than baseline decoding at concurrency 1. The baseline recorded 177.1 output tok/s, MTP 264.2 output tok/s, and DFlash 306.4 output tok/s. At concurrency 16, the baseline reached 975 tok/s, MTP 1808 tok/s, and DFlash 1957 tok/s. The speedups for MoE models were generally smaller than for dense models, as MoE models, with only 3.8 billion active parameters out of 25.2 billion total during inference, are inherently more efficient, leaving less room for speculative decoding gains.

It is noteworthy that gains were not uniform across different workload types. Tasks such as coding, math, STEM, and reasoning benefited more, thanks to more predictable token patterns. Conversely, activities like creative writing, summarization, and roleplay showed smaller improvements, given the greater variability in text continuations. Furthermore, a higher acceptance rate of draft tokens did not always guarantee higher throughput. Although MTP accepted more draft tokens, DFlash demonstrated better throughput on the MoE model. This is due to DFlash processing the entire block in a single forward pass, while MTP processes it token by token. When the target model is already very fast, DFlash's cheaper draft path can make a difference, even with lower acceptance.

Outlook for On-Premise Deployments and Final Considerations

The results of this benchmark offer valuable insights for professionals managing AI infrastructure. The clear indication is that there is no one-size-fits-all solution. The choice between MTP and DFlash, or other optimization techniques, heavily depends on the specific model, the prompts used, the available hardware, and the serving configuration. For those evaluating on-premise deployments, it is crucial to conduct thorough testing with their own technology stack and real-world workloads.

This empirical approach is essential for optimizing TCO and ensuring data sovereignty, which are central aspects for AI-RADAR. The availability of GitHub repositories with benchmark scripts, like the one used in this study, facilitates reproducibility and adaptation of tests to specific needs. Decisions based on concrete data tested in a controlled environment are key to successful and high-performing LLM implementations, both in air-gapped environments and hybrid configurations.