GLM-4.7-Flash performance drop with extended contexts
A user has experienced a performance drop in the GLM-4.7-Flash model as the context length increases. The tests were performed on a system equipped with three NVIDIA GeForce RTX 3090 GPUs, each with compute capability 8.6 and VMM enabled.
Benchmarks and results
Benchmarks performed with llama-bench show a significant decrease in tokens per second (t/s) as the context size increases. For example, with a 200 token prompt, the initial processing speed is around 1985 t/s, but drops to around 350 t/s with a 50000 token context. This suggests that processing longer contexts introduces a significant overhead.
Resource consumption analysis
Analysis of resource consumption during real model usage with a 200000 token context showed a prompt evaluation time of 10238.44 ms for 3136 tokens (approximately 306.30 tokens per second) and an evaluation time of 11570.90 ms for 355 tokens (approximately 30.68 tokens per second).
💬 Comments (0)
🔒 Log in or register to comment on articles.
No comments yet. Be the first to comment!