๐ LLM
AI generated
GLM-4.7-Flash: impressive benchmarks on H200 and RTX 6000 Ada
## GLM-4.7-Flash Benchmarks: High Performance on Various GPUs
New benchmarks of the GLM-4.7-Flash model highlight its capabilities on different hardware configurations. The tests, performed with vLLM and llama.cpp, show impressive results on both high-end GPUs and more accessible solutions.
On a single H200 SXM GPU, GLM-4.7-Flash achieved a peak throughput of 4,398 tokens per second (tok/s) with no concurrency limits. In scenarios with 32 concurrent users, the generation speed was 2,267 tok/s, with a time to first token (TTFT) of 85ms.
On the RTX 6000 Ada (48GB) GPU, using Unsloth dynamic quantization and llama.cpp with a 16K context, the model generated 112 tok/s with Q4_K_XL quantization. Performance remains high even with different quantization schemes, such as Q6_K_XL (100 tok/s) and Q8_K_XL (91 tok/s).
These results suggest that GLM-4.7-Flash is a versatile model, capable of delivering good performance in various contexts, from high-speed inference on servers with dedicated GPUs to use on workstations with consumer-grade GPUs.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!