## GLM 4.7 and llama.cpp: Usage Instructions A user has shared a guide to get the GLM 4.7 model working correctly on llama.cpp, leveraging Flash Attention to accelerate performance. The configuration was tested on an RTX 6000 Blackwell GPU. ## Configuration To enable Flash Attention on CUDA, you need to use this Git branch: [https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize](https://github.com/am17an/llama.cpp/tree/glm_4.7_headsize) In addition, you need to add the following option: `--override-kv deepseek2.expert_gating_func=int:2` ## Performance With this configuration, you can achieve over 2000 tokens per second during prompting and 97 tokens per second during generation. ## Quantization Warning The user warns that the quants may have been created with the wrong function. If so, you need to wait for them to be recreated to avoid nonsensical outputs.

GLM 4.7: How to Run with llama.cpp and Flash Attention

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

In arrivo GLM-4.7-Flash: indiscrezioni sul nuovo modello linguistico

Risolto il problema di GLM 4.7 Flash in Llama.cpp

GLM-4.7-Flash: benchmark da capogiro su H200 e RTX 6000 Ada