๐ LLM
AI generated
GLM 4.7: How to Run with llama.cpp and Flash Attention
## GLM 4.7 and llama.cpp: Usage Instructions
A user has shared a guide to get the GLM 4.7 model working correctly on llama.cpp, leveraging Flash Attention to accelerate performance. The configuration was tested on an RTX 6000 Blackwell GPU.
## Configuration
To enable Flash Attention on CUDA, you need to use this Git branch:
[https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize](https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize)
In addition, you need to add the following option:
`--override-kv deepseek2.expert_gating_func=int:2`
## Performance
With this configuration, you can achieve over 2000 tokens per second during prompting and 97 tokens per second during generation.
## Quantization Warning
The user warns that the quants may have been created with the wrong function. If so, you need to wait for them to be recreated to avoid nonsensical outputs.
๐ฌ Commenti (0)
๐ Accedi o registrati per commentare gli articoli.
Nessun commento ancora. Sii il primo a commentare!