GLM-4.7-Flash performance drop with extended contexts

A user has experienced a performance drop in the GLM-4.7-Flash model as the context length increases. The tests were performed on a system equipped with three NVIDIA GeForce RTX 3090 GPUs, each with compute capability 8.6 and VMM enabled.

Benchmarks and results

Benchmarks performed with llama-bench show a significant decrease in tokens per second (t/s) as the context size increases. For example, with a 200 token prompt, the initial processing speed is around 1985 t/s, but drops to around 350 t/s with a 50000 token context. This suggests that processing longer contexts introduces a significant overhead.

Resource consumption analysis

Analysis of resource consumption during real model usage with a 200000 token context showed a prompt evaluation time of 10238.44 ms for 3136 tokens (approximately 306.30 tokens per second) and an evaluation time of 11570.90 ms for 355 tokens (approximately 30.68 tokens per second).