Running Claude Code locally with OpenCode, llama.cpp and GLM-4.7 Flash

A user on Reddit described how to run a workflow similar to Claude Code locally, using OpenCode, llama.cpp, and the GLM-4.7 Flash model. The goal is to replicate the development experience offered by Claude, but leveraging local computing resources.

Configuration and parameters

The configuration involves using CUDA for GPU acceleration. Specific parameters used include:

CUDA_VISIBLE_DEVICES=0,1,2: selection of GPUs to use.
llama-server --jinja --host 0.0.0.0: starting the llama.cpp server.
-m /mnt/models1/GLM/GLM-4.7-Flash-Q8_0.gguf: specification of the GLM-4.7 Flash model quantized to 8 bits.
--ctx-size 200000: setting the context size to 200,000 tokens.
--parallel 1 --batch-size 2048 --ubatch-size 1024: parameters for batch management.
--flash-attn on: enabling flash attention to improve efficiency.
--cache-ram 61440: allocating 61440 MB of RAM for the cache.
--context-shift: enabling context shifting.

The configuration described demonstrates how it is possible, with the appropriate tools and in-depth knowledge of the parameters, to replicate locally workflows typically associated with cloud services, while maintaining complete control over data and infrastructure.

Running Claude Code locally with OpenCode, llama.cpp and GLM-4.7 Flash

Configuration and parameters

💬 Commenti (0)

📚 Approfondimenti

Approfondisci su LLM On-Premise

Disponibile GLM-4.7-Flash-GGUF per l'inferenza locale di LLM

GLM-4.7 flash: come eseguirlo con llama.cpp?

Llama.cpp: integrato fix CUDA per GLM 4.7 Flash Attention