AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

📁 Hardware AI generated

Qwen 3.5 35B MoE: 40+ tokens/s on RTX 5060 Ti with 100k context

Published on 2026-02-26 08:59 ℹ️ LocalLLaMA 📰 Read the original source article →

🏷️ Hardware 🏷️ LLM On-Premise 🏷️ DevOps

Qwen 3.5 35B MoE: 40+ token/s su RTX 5060 Ti con contesto 100k

Qwen 3.5 35B MoE: Performance on RTX 5060 Ti

A user reported impressive performance results for the Qwen 3.5 35B MoE language model, running on an NVIDIA GeForce RTX 5060 Ti graphics card with 16GB of VRAM. The test used a context length of 100,000 tokens.

Configuration Details

Model: Qwen 3.5 35B MoE
GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
CPU: AMD Ryzen 7 9700X
Backend: CUDA and Vulkan
Context Length: 100,000 tokens

Results

The tests showed a generation speed of approximately 40 tokens per second (tps) with both CUDA and Vulkan backends. Specifically, CUDA achieved a speed of 44.32 tps, while Vulkan reached 41.35 tps. During prompt processing (fill) with a 99961 token text, the speed reached 1154.31 tps.

llama.cpp command used

llama-server.exe -m "/Qwen3.5-35B-A3B-MXFP4_MOE.gguf" --port 6789 --ctx-size 131072 -n 32768 --flash-attn on -ngl 40 --n-cpu-moe 24 -b 2048 -ub 2048 -t 8 --kv-offload --cont-batching --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0

These results suggest that large language model inference is becoming increasingly accessible on consumer hardware. For those evaluating on-premise deployments, there are trade-offs to consider, and AI-RADAR offers analytical frameworks at /llm-onpremise for evaluation.

AI-Radar Takeaway

Performance tests of the Qwen 3.5 35B MoE language model on an RTX 5060 Ti 16GB. Results show generation speeds exceeding 40 tokens per second with a 100,000 token context, opening possibilities for LLM inference on consumer hardware. Tests were performed with CUDA and Vulkan backends.

🤖 Ask AI about this

Want to dive deeper? Read the full article from the source:

📖 READ THE ORIGINAL ARTICLE

💻 Need GPU Cloud Infrastructure?

For running LLM inference, training models, or testing hardware configurations, check out this platform:

Railway Cloud Infrastructure

Modern cloud platform with instant deployments. Deploy from GitHub in seconds with automatic HTTPS, databases, and monitoring. Perfect for web apps, APIs, and LLM inference services.

✓ GitHub integration ✓ Auto HTTPS ✓ Simple pricing

🔗 This is an affiliate link - we may earn a commission at no extra cost to you.

AI-RADAR NEWSLETTER

Stay ahead — get AI signals in your inbox

Daily or weekly digest of the most important AI news. 160+ readers, no spam.

💬 Comments (0)

🔒 Log in or register to comment on articles.

No comments yet. Be the first to comment!

🔍 Continue Exploring

Explore LLM On-Premise

Complete guide to running AI models locally: hardware, stack, and privacy.

Qwen3 Coder: Improved Performance with Llama.cpp

Frameworks Feb 15

Qwen3 Coder: Improved Performance with Llama.cpp

A recent update to Llama.cpp appears to have significantly improved the performance of the Qwen3 Coder Next model. Tests indicate an increase in throughput, mea

GPU Comparison for AI Workloads: RTX 5090 vs RTX 6000 PRO on Power and Efficiency

Hardware May 27

GPU Comparison for AI Workloads: RTX 5090 vs RTX 6000 PRO on Power and Efficiency

A comparative analysis of NVIDIA RTX 5090 and RTX 6000 PRO (MaxQ and WS/SE) performance for AI diffusion workloads highlights the trade-offs between power consu

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

A 35B MoE LLM on a Single RTX 3090: Speed and Quality Within Consumer Reach

With APEX I-Quality and the turbo8 codec, Qwen3.6-35B-A3B hits 137 t/s and 128k context on a single RTX 3090. Tests show the spiritbuun fork matches ik_llama, a

Nvidia reportedly working on RTX 5050 with 9GB GDDR7 on 96-bit bus

Hardware Mar 05

Nvidia reportedly working on RTX 5050 with 9GB GDDR7 on 96-bit bus

Rumors suggest Nvidia is developing the RTX 5050 with 9GB of GDDR7 VRAM on a 96-bit bus, alongside an RTX 5060 with a cut-down GB205 GPU. The new cards could ta

AI craze leaves only one Nvidia RTX 50-series GPU at MSRP — RTX 5060 Ti 8GB is the final stand, as even the RTX 5050 falls

Hardware Feb 20

AI craze leaves only one Nvidia RTX 50-series GPU at MSRP — RTX 5060 Ti 8GB is the final stand, as even the RTX 5050 falls

High demand for Nvidia GPUs dedicated to artificial intelligence is impacting market prices. The RTX 5060 Ti 8GB remains the only card in the 50 series still av

More in Hardware

Intel's Nova Lake: 52 cores and up to 474W for the next-gen desktop

96GB 4090 and 5090 GPUs: Scam Alert from a US Lab

A 96GB VRAM RTX 5090 from Shenzhen's Huaqiangbei Market for $8,200

AI chip demand squeezes global freight, putting on-premise plans at risk

JCET's US$1.1bn expansion shows where China's AI chip crunch is moving

Liquid Cooling Comes to 800V DC Busbars for AI Data Centers

→ View all in Hardware →

AI-Radar LLM On-Premise

Complete guide to running AI models locally: hardware, stack, privacy, and reference architectures.

👥 Join 160+ AI explorers

A free community of developers, engineers and AI enthusiasts following local AI daily.

Register free → Already a member? Log in