AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Hardware AI generated

Qwen3.5 122B on RTX 4090: Optimization and Performance

Published on 2026-03-06 10:01 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise

Qwen3.5 122B su RTX 4090: ottimizzazione e performance

Qwen3.5 122B: Optimization on Consumer Hardware

A user tested the Qwen3.5 122B A10B model, quantized with Unsloth, on a hardware configuration consisting of an RTX 4090, an RTX 3090, an Intel i7 13700k, and 128 GB of DDR5 RAM at 5600 MHz. The goal was to achieve stable performance, overcoming difficulties encountered with previous quantizations.

Configuration and Performance

The user found that manual tensor fitting via specific parameters (--n-cpu-moe 33 -ts 4,1 -c 32000) offers superior performance compared to using the --fit flag. Specifically, the prompt processing speed increased from 30.8 tokens/s to 143.4 tokens/s, while generation improved from 9.1 tokens/s to 18.6 tokens/s. This approximately 50% increase results in less performance degradation with larger contexts.

BF16 Cache and Repeat Penalty

Using the BF16 cache (-cache-type-k bf16 --cache-type-v bf16) improved the reasoning quality of the model, avoiding logical loops encountered with the default FP16 configuration. Furthermore, applying a "repeat penalty" (--presence-penalty 1.1 --repeat-penalty 1.05 --repeat-last-n 512) proved necessary to prevent the repetition of patterns in the generated text, a behavior not observed in other models tested by the user.

Final Impressions

Despite the improvements obtained through optimization, the user considers Qwen3.5 122B A10B still too slow for effective agentic use, preferring alternative models such as Minimax M2.5 IQ4_NL for reasoning capabilities and speed. The user speculates that llama.cpp may not be fully optimized for this specific model.

AI-Radar Takeaway

A user shared their experience optimizing the Qwen3.5 122B A10B model on consumer hardware, highlighting the importance of manual tensor fitting and BF16 cache to improve performance and stability. The results show a significant increase in processing speed compared to automatic settings, but with some limitations in agentic use.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Qwen3-Coder: improved performance on RTX 5090 with llama.cpp

Hardware Feb 06

Qwen3-Coder: improved performance on RTX 5090 with llama.cpp

A user reported a significant throughput increase, up to 26 tokens/second, using the Qwen3-Coder-Next-Q4_K_S model with llama.cpp on an RTX 5090. The optimizati

Llama.cpp's "--fit" Speeds Up Qwen3-Coder-Next on RTX 3090

Frameworks Feb 08

Llama.cpp's "--fit" Speeds Up Qwen3-Coder-Next on RTX 3090

A user reported significant performance improvements for Qwen3-Coder-Next using the "--fit" option in Llama.cpp on a dual RTX 3090 setup. The results indicate a

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

With APEX I-Quality and the turbo8 codec, Qwen3.6-35B-A3B hits 137 t/s and 128k context on a single RTX 3090. Tests show the spiritbuun fork matches ik_llama, a

Optimizing Qwen 3.6 27B On-Premise: Performance and Configurations on RTX 3090

Optimizing Qwen 3.6 27B On-Premise: Performance and Configurations on RTX 3090

A user shared a configuration to accelerate Qwen 3.6 27B (MTP GGUF) inference on an NVIDIA RTX 3090 GPU. This setup, leveraging `llama.cpp` with techniques like

Qwen3.6-27B on RTX 3090: 218K Context and Improved Stability

Qwen3.6-27B on RTX 3090: 218K Context and Improved Stability

A development team has achieved significant results in running the Large Language Model Qwen3.6-27B on a single NVIDIA RTX 3090 GPU. The optimization allowed ex

More in Hardware

96GB 4090 and 5090 GPUs: Scam Alert from a US Lab

A 96GB VRAM RTX 5090 from Shenzhen's Huaqiangbei Market for $8,200

AI chip demand squeezes global freight, putting on-premise plans at risk

JCET's US$1.1bn expansion shows where China's AI chip crunch is moving

Liquid Cooling Comes to 800V DC Busbars for AI Data Centers

Intel readies HDR support for DP MST configurations on Linux

→ View all in Hardware →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in