## GLM 4.7 and llama.cpp: Usage Instructions A user has shared a guide to get the GLM 4.7 model working correctly on llama.cpp, leveraging Flash Attention to accelerate performance. The configuration was tested on an RTX 6000 Blackwell GPU. ## Configuration To enable Flash Attention on CUDA, you need to use this Git branch: [https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize](https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize) In addition, you need to add the following option: `--override-kv deepseek2.expert_gating_func=int:2` ## Performance With this configuration, you can achieve over 2000 tokens per second during prompting and 97 tokens per second during generation. ## Quantization Warning The user warns that the quants may have been created with the wrong function. If so, you need to wait for them to be recreated to avoid nonsensical outputs.