📁 Altro AI generated

Qwen3 30B runs at 7-8 t/s on Raspberry Pi 5

Published on 2026-03-20 14:39 ℹ️ LocalLLaMA 📰 Read the original source article →

Qwen3 30B gira a 7-8 token/s su Raspberry Pi 5

A user reported successfully running the Qwen3 30B language model on an 8GB Raspberry Pi 5, achieving a speed of 7-8 tokens per second.

Implementation Details

The implementation includes:

An SSD for faster storage.
The official active cooler for Raspberry Pi 5.
A custom build of ik_llama.cpp.
Prompt caching.

The model used is byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF, specifically the Q3_K_S 2.66bpw quantization. The user reports that with a 4-bit quantization of the same model family, you can expect 4-5 tokens per second.

Potato OS

The whole thing is packaged as a flashable headless Debian image called Potato OS. After boot, Qwen3.5 2B with vision encoder is automatically downloaded. It is possible to select a different model, paste a HuggingFace URL, or upload one over LAN through the web interface. It exposes an OpenAI-compatible API on the local network.

Considerations

For those evaluating on-premise deployments, there are trade-offs between performance, costs, and data control. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.

AI-Radar Takeaway

A user has successfully run the Qwen3 30B language model on an 8GB Raspberry Pi 5, achieving a speed of 7-8 tokens per second. The implementation includes a custom ik_llama.cpp build, prompt caching, and a flashable Debian image for simplified deployment. The system exposes an OpenAI-compatible API on the local network.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

SECTION

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Read →

LLM Mar 27

Qwen3.5 122B: Slower Means Faster for Complex Workloads?

A Reddit user found that, contrary to expectations, the Qwen3.5 122B model, despite having lower specs than Qwen3 Coder Next, offered superior performance in te

Read →

LLM Feb 18

ByteShape LLMs: Coder Models for Every Hardware, Including Raspberry Pi

ByteShape releases Devstral-Small-2-24B and Qwen3-Coder-30B, models optimized for various hardware platforms. Devstral excels on RTX 40/50 GPUs, while Qwen3-Cod

Read →

LLM Mar 19

Qwen 0.5B: Local fine-tuning for task automation

A developer has fine-tuned the Qwen2-0.5B model to automate tasks via natural language, generating execution plans (CLI commands and hotkeys). Inference occurs

Read →

Hardware Mar 02

Local LLM performance: growing capabilities with compact hardware

The article analyzes the progress made in running large language models (LLMs) locally, highlighting how performance has improved significantly thanks to hardwa

Read →

Altro May 23

Qwen3.6 27B on 16 GB VRAM: 'Pure' Quantization Enables Local Inference

A recent experiment showcased the ability to run the Qwen3.6 27B Large Language Model on hardware with only 16 GB of VRAM, achieving a token generation speed of

Read →

Qwen3 30B runs at 7-8 t/s on Raspberry Pi 5

Implementation Details

Potato OS

Considerations

Stay ahead — get AI signals in your inbox

💬 Comments (0)

🔍 Continue Exploring

More in Altro

👥 Join 160+ AI explorers