An LLM enthusiast shared their experience in simplifying their homelab, reducing the number of running models from three to one, all on a system based on Ryzen AI MAX+ with 128GB of RAM and 96GB of shared VRAM via Vulkan/RADV.

Previous Setup

The original configuration included three distinct models:

  • GLM-4.7-Flash (30B MoE, 3B active): for daily tasks such as email management.
  • Qwen3.5-35B-A3B (35B MoE, 3B active): for reasoning and coding.
  • Qwen3-VL-8B (8B dense): for computer vision and video stream management.

The move to a single model was motivated by the complexity of managing routing between the different models.

New Setup

The new configuration is based on a single model:

  • Qwen3.5-122B-A10B UD-IQ3_S (10B active, 44GB).
  • VL-8B kept separate to avoid conflicts with camera management.
  • Nomic-embed for retrieval augmented generation (RAG).

This setup handles tasks such as email classification, food management applications, financial dashboards, person detection via cameras, Open WebUI, SearXNG, OpenCode, and an OpenClaw agent.

Surprising Findings

  • IQ3 quantization achieved similar results to Q4_K_M with half the VRAM and greater speed.
  • MoE models proved more efficient than dense models, especially in concurrent scenarios.
  • Unsloth Dynamic quants work fine on Strix Halo.