The promise of running a Large Language Model with a one‑million‑token context locally usually collapses under VRAM demands. DeepSeek V4 Flash, with its DSA lightning indexer, was no exception: without optimization, reaching that context width required roughly 256 GB of video memory – a number that excludes even the most generous workstations. Then a patch arrived.

A community contributor spotted a gap in llama.cpp’s support for the model’s lightning index. Upstream pull request #24231 (by u/fairydreaming) laid the groundwork, but it was not integrated into the model graph and lacked a CUDA path. The work was finished by wiring the component and writing a custom CUDA kernel, tested on an RTX 5090, 9950X3D CPU, and 96 GB of DDR5, with a DeepSeek‑V4‑Flash Q8/Q4/Q2 mixed quantization prepared by antirez.

The numbers tell the story. At 256K context, the compute buffer plunged from 67 GiB (out‑of‑memory) to 3.2 GiB; prefill jumped from 56 to roughly 263 tokens/s, while decode remained stable near 14 tokens/s. The most striking result is operation at 1 million tokens: previously impossible (needing ~256 GB), it now fits 3.75 GiB with ubatch 768 and about 6 GiB at 2048, peaking at 31 GB VRAM. Prefill speed dips to 159 tokens/s only because of the reduced ubatch on a 32 GB GPU; with more VRAM it would return to full speed.

Reliability was checked with a needle‑in‑haystack test: a random fact inserted into a 100K‑token, 512K‑token, and 1M‑token document was retrieved correctly every time, even at the 50% depth in the hardest case.

For those evaluating on‑premise deployment, the implications are tangible. Running million‑token corpora on a single consumer GPU – a $2,000 RTX 5090 – upends the Total Cost of Ownership calculation and brings ultra‑long‑context inference within reach of law firms, insurance companies, and research labs that need to keep data inside their own walls. AI‑RADAR has built analytical frameworks on /llm‑onpremise for weighing these trade‑offs, but the trend is clear: open‑source and the community are closing the gap ahead of vendors, dismantling hardware barriers that seemed insurmountable only yesterday. The patch is not yet officially merged and has been tested on just one GPU, but the signal is strong: the race for context length is no longer the exclusive preserve of data centers.