PFlash: 10x LLM Prefill Acceleration on RTX 3090 for 128K Contexts

Optimizing LLM Prefill on Local Hardware

Optimizing Large Language Model (LLM) performance on consumer hardware represents a crucial challenge for on-premise deployments. A particularly critical aspect is the "prefill" phase, which is the time required to process a long prompt before the LLM begins to generate the first response. This latency can severely compromise the user experience, especially with contexts extending to tens of thousands of tokens.

In this scenario, Luce-Org has introduced PFlash, a solution that promises to revolutionize prefill efficiency. The project, based on a C++/CUDA stack, demonstrates an acceleration up to 10 times faster than standard implementations like llama.cpp, operating on an NVIDIA RTX 3090 GPU with a 128,000-token context. This result is particularly relevant for organizations aiming to maintain control over their data and optimize the Total Cost of Ownership (TCO) of their AI infrastructures.

Technical and Architectural Details of PFlash

PFlash integrates a "speculative prefill" approach for decoding long contexts on quantized 27-billion-parameter models. The methodology relies on using a small "drafter" model (a Qwen3-0.6B BF16) which, loaded in-process, scores token importance across the entire prompt. The larger "target" model (a Qwen3.6-27B Q4_K_M) then prefills only the portions of the prompt deemed relevant by the drafter.

This composition occurs entirely in C++/CUDA, without the use of Python, Triton, or PyTorch in the inference loop. Luce-Org's implementation stands out for combining two recent research papers into a single open-source solution: "Speculative Prefill" (Liu et al.) and "FlashPrefill" (Fan et al.). The latter, in particular, solves the O(S²) scalability problem of the drafter at high contexts by using block-sparse attention.

A further innovation concerns VRAM memory orchestration, essential for coexisting models on a single 24 GB consumer GPU like the RTX 3090. The system manages the loading and unloading of weights between different stages, allowing the entire pipeline to fit within memory constraints, at a cost of approximately 3 seconds per request for park/unpark/free operations.

Implications for On-Premise Deployments

The optimizations introduced by PFlash have a direct and significant impact on CTOs, DevOps leads, and infrastructure architects evaluating on-premise LLM deployments. The drastic reduction in Time To First Token (TTFT) improves user experience in interactive applications, making the use of long-context models more practical on local infrastructures. This is fundamental for scenarios requiring data sovereignty, regulatory compliance, or air-gapped environments, where cloud solutions are not an option.

From a TCO perspective, the ability to run complex LLMs on existing consumer hardware, such as an RTX 3090, can reduce the need for investments in more expensive enterprise-grade GPUs or the adoption of pay-as-you-go cloud services. The choice of a native C++/CUDA stack, free from dependencies on high-level frameworks like Python or PyTorch in the critical inference path, contributes to more efficient execution and lower overhead. For those evaluating the trade-offs between self-hosted and cloud solutions, AI-RADAR offers analytical frameworks and insights on /llm-onpremise to support informed decisions.

Future Prospects and Optimizations

While initial benchmarks with NIAH single-needle show excellent quality preservation, Luce-Org's developers acknowledge the need for further testing with more complex metrics like RULER and NIAH multi-needle for a comprehensive evaluation. Currently, the dominant cost in the 24.8-second TTFT at 128K tokens is the drafter scoring, which takes approximately 12 seconds. Target prefill on the selected tokens requires about 10 seconds, while memory orchestration accounts for the remaining 3 seconds.

This suggests that future optimizations could focus on reducing the size or distilling the drafter, an area the team has not yet explored. The flexibility offered by tuning parameters, such as keep_ratio and DFLASH_FP_ALPHA, allows users to balance performance and quality according to their specific needs. Luce-Org's work highlights how innovation in integrating existing algorithms can unlock new capabilities for efficient LLM execution on local infrastructures.

PFlash: 10x LLM Prefill Acceleration on RTX 3090 for 128K Contexts

Optimizing LLM Prefill on Local Hardware

Technical and Architectural Details of PFlash

Implications for On-Premise Deployments

Future Prospects and Optimizations

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

Groq unveils LPU and LPX racks with Rubin platform at GTC

Qwen3.5-0.8B: LLM inference on legacy hardware without GPUs

Fine-tuning Qwen 14B for Discord Autocomplete

👥 Join 160+ AI explorers