DeepSeek strengthens the open-source ecosystem with DeepSpec, a toolkit that addresses one of the most pressing bottlenecks in large model inference: latency without additional hardware investment. The codebase, designed to train and evaluate auxiliary (draft) models for speculative decoding, ships with ready-made checkpoints for Qwen3-4B, 8B, 14B and Gemma-4-12B-it. Three algorithmic variants – Eagle3, DFlash, and DSpark – provide different options depending on the target and operational constraints.

What’s inside DeepSpec

The repository includes everything needed to reproduce the paper’s results: data preparation utilities, draft model implementations, training code, and evaluation scripts. The released checkpoints were generated from “open-perfectblend” data produced by the corresponding target model in non-thinking mode, using the standard configurations under config/. An important caveat: if the intention is to run the target model in thinking mode (extended reasoning), DeepSeek strongly recommends fine-tuning the draft model again to maintain alignment; otherwise, performance comparisons may not be meaningful.

Why speculative decoding matters for self-hosters

Speculative decoding improves inference temporal efficiency by having a smaller draft model propose a sequence of tokens that the large model verifies in parallel. The result is higher throughput with the same GPU resources. In on-premise settings, where every gigabyte of VRAM and every watt counts, this technique can lower TCO without sacrificing large model sizes. For enterprises, having a standardized, open-source, transparent pipeline like DeepSpec means being able to customize draft models on their own models and data, preserving sovereignty across the entire stack.

DeepSeek’s multi-algorithm approach

Not all drafts are equal. Eagle3, DFlash, and DSpark adopt different architectures and alignment strategies. The choice depends on the target model and latency or energy goals. DeepSpec supplies ready configurations for each combination, simplifying comparison and adoption. Moreover, using synthetic data generated by the target model itself – in non-thinking mode – reduces the need for external datasets, lowering the barrier to entry for those wanting to experiment with speculative decoding in-house.

A push toward mature local inference

With DeepSpec, DeepSeek signals that speculative decoding is no longer an academic experiment but an integrable component in production flows. Checkpoints for popular models like Qwen and Gemma speed up experimentation, while the code’s modular structure allows adaptation to uncovered targets. For teams evaluating on-premise deployment, such tools are increasingly central: they enable performance gains without chasing cutting-edge hardware. AI-RADAR follows these developments closely because they redefine the boundaries of what is technically possible on owned stacks, balancing latency, cost, and control.