GLM-4.7-Flash performance drop with extended contexts
A user has experienced a performance drop in the GLM-4.7-Flash model as the context length increases. The tests were performed on a system equipped with three NVIDIA GeForce RTX 3090 GPUs, each with compute capability 8.6 and VMM enabled.
Benchmarks and results
Benchmarks performed with llama-bench show a significant decrease in tokens per second (t/s) as the context size increases. For example, with a 200 token prompt, the initial processing speed is around 1985 t/s, but drops to around 350 t/s with a 50000 token context. This suggests that processing longer contexts introduces a significant overhead.
Resource consumption analysis
Analysis of resource consumption during real model usage with a 200000 token context showed a prompt evaluation time of 10238.44 ms for 3136 tokens (approximately 306.30 tokens per second) and an evaluation time of 11570.90 ms for 355 tokens (approximately 30.68 tokens per second).
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!