A user reported successfully running the Qwen3 30B language model on an 8GB Raspberry Pi 5, achieving a speed of 7-8 tokens per second.
Implementation Details
The implementation includes:
- An SSD for faster storage.
- The official active cooler for Raspberry Pi 5.
- A custom build of
ik_llama.cpp. - Prompt caching.
The model used is byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF, specifically the Q3_K_S 2.66bpw quantization. The user reports that with a 4-bit quantization of the same model family, you can expect 4-5 tokens per second.
Potato OS
The whole thing is packaged as a flashable headless Debian image called Potato OS. After boot, Qwen3.5 2B with vision encoder is automatically downloaded. It is possible to select a different model, paste a HuggingFace URL, or upload one over LAN through the web interface. It exposes an OpenAI-compatible API on the local network.
Considerations
For those evaluating on-premise deployments, there are trade-offs between performance, costs, and data control. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!