GLM 4.7 and llama.cpp: Usage Instructions
A user has shared a guide to get the GLM 4.7 model working correctly on llama.cpp, leveraging Flash Attention to accelerate performance. The configuration was tested on an RTX 6000 Blackwell GPU.
Configuration
To enable Flash Attention on CUDA, you need to use this Git branch:
https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize
In addition, you need to add the following option:
--override-kv deepseek2.expert_gating_func=int:2
Performance
With this configuration, you can achieve over 2000 tokens per second during prompting and 97 tokens per second during generation.
Quantization Warning
The user warns that the quants may have been created with the wrong function. If so, you need to wait for them to be recreated to avoid nonsensical outputs.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!