Local LLM Inference with Limited Resources
A user described their setup for running the Qwen 3.5 35B language model locally, leveraging an RTX 4060m GPU with only 8GB of VRAM. The goal is to create an efficient agentic development environment, overcoming the limitations encountered with cloud-based solutions.
Hardware Configuration and Optimizations
The system used is a Lenovo Legion equipped with an Intel i9-14900HX processor (with E-cores disabled) and 32GB of DDR5 RAM. To optimize model performance, the user employed llama.cpp with specific parameters:
-ngl 99--n-cpu-moe 40-c 192000-t 12-tb 16-b 4096--ubatch-size 2048--flash-attn on--cache-type-k q8_0--cache-type-v q8_0--mlock
These settings allow reaching approximately 700 tokens/s during the prompt processing phase and 42 tokens/s for token generation. The user is evaluating whether this local configuration is preferable to using smaller and faster models hosted in the cloud, considering that data privacy is not a top priority in their use case.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!