GLM 5.2 local speeds: 7.8 tokens/sec with six RTX 3090s and 90K context

The community call and the first benchmark

On Reddit, user u/neverbyte made a simple, pragmatic request: “If you are able to run GLM 5.2 locally, can you share your inference engine, system specs, quantization, context size, and tokens per second?” The goal is clear: to map the model’s real-world performance outside the controlled conditions of official benchmarks, collecting data from concrete systems often assembled with consumer or refurbished hardware.

The first to answer was u/neverbyte themselves, providing a configuration that already sparks discussion: llamma.cpp inference engine, six RTX 3090 cards, 128 GB of DDR5 RAM, an i7-13700K processor, UD-IQ2_M quantization, and a context window pushed to 90,000 tokens with K/V cache at Q8_0. Generation came in at 7.8 tokens per second, while prompt processing reached roughly 40 tokens per second.

An extreme configuration for an extreme context

The most striking number is the choice to run the model with an ultra-wide context – 90 K tokens – alongside only 2-bit quantization. The UD-IQ2_M variant (from the Unsloth project) represents one of the most aggressive precision reductions, drastically compressing parameters to fit the LLM into available video memory. Six RTX 3090s provide 144 GB of VRAM, but the use of an 8-bit quantized K/V cache shows that the designers of this setup prioritized context length over the quality of individual generated tokens.

7.8 tokens per second is not a record speed: in many real-time chat scenarios, values below 10 tokens/s can be noticeable and reduce interaction fluidity. However, it’s important to remember that this is an entirely local system, with no network latency and full data sovereignty.

What it means for those evaluating on-premise deployment

For those considering bringing high-end models into a company or lab, the GLM 5.2 case is instructive. Six RTX 3090s on the used market carry a non-trivial aggregate cost and consume several hundred watts, impacting TCO. Extreme quantization helps contain the hardware investment, but introduces a trade-off: it reduces model accuracy, especially on complex tasks or long reasoning chains.

AI-RADAR closely follows the real metrics provided by the community precisely because they are the litmus test of on-premise deployment choices. The question is not just “can it be run?” but “at what cost in terms of quality and latency?” – and numbers like these help shape a more realistic TCO analysis.

Long context vs. speed: the balancing point

Pushing the context window to 90,000 tokens on consumer GPUs is a significant technical achievement. It means being able to process the equivalent of hundreds of pages in a single request, enabling applications such as analyzing large legal documents, summarizing entire codebases, or searching extended knowledge bases. The price, however, is slow generation (7.8 tokens/s) and prompt processing at 40 tokens/s – numbers that, in production, might not be acceptable for interactive applications.

The debate opened by u/neverbyte is therefore not just a tinkerer’s curiosity: it shows that the next frontier for on-premise LLM deployment will be finding a balance between context, speed, and accuracy, leveraging advanced quantization like that offered by Unsloth, while never forgetting that every bit lost means lost information.

A look at the bigger picture

This first community benchmark reminds us that the local inference ecosystem is built on incremental choices: from the engine selection (llamma.cpp, vLLM, TGI) to the GPU combination, through quantization and memory management. GLM 5.2, a model developed by Tsinghua and distributed under an open license, lends itself well to such experiments precisely because its architecture is well-suited to being compressed and adapted to extended contexts.

The most useful takeaway for the AI-RADAR audience is confirmation that even with dated consumer-grade hardware (the RTX 3090s launched in 2020) it is possible to work with very large contexts, provided one accepts speed compromises and invests in multi-GPU configurations. The Reddit discussion, with the promise of new metrics from other users, will be an important thermometer for understanding whether 7.8 tokens/sec represents an outlier or the typical value for this class of deployment.