AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Hardware AI generated

Homelab LLM: Consolidated from 3 Models to One on Ryzen AI MAX+

Published on 2026-03-27 03:57 ℹ️ LocalLLaMA 📰 Read the original source article →

Homelab LLM: da tre modelli a uno su Ryzen AI MAX+

An LLM enthusiast shared their experience in simplifying their homelab, reducing the number of running models from three to one, all on a system based on Ryzen AI MAX+ with 128GB of RAM and 96GB of shared VRAM via Vulkan/RADV.

Previous Setup

The original configuration included three distinct models:

GLM-4.7-Flash (30B MoE, 3B active): for daily tasks such as email management.
Qwen3.5-35B-A3B (35B MoE, 3B active): for reasoning and coding.
Qwen3-VL-8B (8B dense): for computer vision and video stream management.

The move to a single model was motivated by the complexity of managing routing between the different models.

New Setup

The new configuration is based on a single model:

Qwen3.5-122B-A10B UD-IQ3_S (10B active, 44GB).
VL-8B kept separate to avoid conflicts with camera management.
Nomic-embed for retrieval augmented generation (RAG).

This setup handles tasks such as email classification, food management applications, financial dashboards, person detection via cameras, Open WebUI, SearXNG, OpenCode, and an OpenClaw agent.

Surprising Findings

IQ3 quantization achieved similar results to Q4_K_M with half the VRAM and greater speed.
MoE models proved more efficient than dense models, especially in concurrent scenarios.
Unsloth Dynamic quants work fine on Strix Halo.

AI-Radar Takeaway

A user consolidated their homelab, moving from three distinct LLM models to a single 122B parameter MoE (Mixture of Experts) model on a machine with Ryzen AI MAX+ and 128GB of RAM. The goal was to simplify routing and improve resource management, evaluating different models and configurations through benchmarks.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Vast.ai GPU Marketplace

Decentralized GPU marketplace with ultra-competitive pricing. Rent from a global network of providers. Perfect for experimentation, development, and cost-optimized workloads.

✓ Lowest prices ✓ Global network ✓ Flexible options

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

AMD Ryzen AI Max: llama.cpp ROCm Speed Boost Reported

Hardware Feb 16

AMD Ryzen AI Max: llama.cpp ROCm Speed Boost Reported

A significant speed increase in prompt processing with llama.cpp and ROCm on Strix Halo (Ryzen AI Max) has been reported. Benchmarks show improvements of up to

MoE Training: 12x Faster with Unsloth and Reduced VRAM

Frameworks Feb 10

MoE Training: 12x Faster with Unsloth and Reduced VRAM

Unsloth AI announced optimizations for Mixture of Experts (MoE) model training, promising 12x faster speeds and a VRAM consumption reduction of over 35%. The op

GLM-4.7-Flash: impressive benchmarks on H200 and RTX 6000 Ada

GLM-4.7-Flash: impressive benchmarks on H200 and RTX 6000 Ada

The GLM-4.7-Flash model demonstrates remarkable performance in new benchmarks. On a single H200 GPU, it achieves a peak throughput of 4,398 tokens per second. U

Qwen3-Coder: improved performance on RTX 5090 with llama.cpp

Hardware Feb 06

Qwen3-Coder: improved performance on RTX 5090 with llama.cpp

A user reported a significant throughput increase, up to 26 tokens/second, using the Qwen3-Coder-Next-Q4_K_S model with llama.cpp on an RTX 5090. The optimizati

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

With APEX I-Quality and the turbo8 codec, Qwen3.6-35B-A3B hits 137 t/s and 128k context on a single RTX 3090. Tests show the spiritbuun fork matches ik_llama, a

More in Hardware

Intel's Nova Lake: 52 cores and up to 474W for the next-gen desktop

96GB 4090 and 5090 GPUs: Scam Alert from a US Lab

A 96GB VRAM RTX 5090 from Shenzhen's Huaqiangbei Market for $8,200

AI chip demand squeezes global freight, putting on-premise plans at risk

JCET's US$1.1bn expansion shows where China's AI chip crunch is moving

Liquid Cooling Comes to 800V DC Busbars for AI Data Centers

→ View all in Hardware →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in