A user reported successfully running the Qwen3 30B language model on an 8GB Raspberry Pi 5, achieving a speed of 7-8 tokens per second.

Implementation Details

The implementation includes:

  • An SSD for faster storage.
  • The official active cooler for Raspberry Pi 5.
  • A custom build of ik_llama.cpp.
  • Prompt caching.

The model used is byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF, specifically the Q3_K_S 2.66bpw quantization. The user reports that with a 4-bit quantization of the same model family, you can expect 4-5 tokens per second.

Potato OS

The whole thing is packaged as a flashable headless Debian image called Potato OS. After boot, Qwen3.5 2B with vision encoder is automatically downloaded. It is possible to select a different model, paste a HuggingFace URL, or upload one over LAN through the web interface. It exposes an OpenAI-compatible API on the local network.

Considerations

For those evaluating on-premise deployments, there are trade-offs between performance, costs, and data control. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.