Qwen3.5: promising performance for real-world workloads

A Reddit user shared their experience with the Qwen3.5-35B-A3B-UD-Q6_K_XL model, expressing enthusiasm for its performance in real-world usage scenarios.

Performance and speed

In the tests performed, the model reached a speed of 1504 tokens per 2048 and 47.71 tokens per 256. The token generation speed was high, especially when the model was run on a single GPU, reaching 80 tokens per second.

Testing on real projects

The user tested the model on several projects, using Git Worktrees to simulate specific changes and features. The results were positive, with most issues solvable with minimal tweaks or additional prompts.

Hybrid model and hardware considerations

The experience led the user to consider a hybrid model, using APIs for state-of-the-art models for spec generation and local models for job execution. The user is considering purchasing an RTX 6000 Pro, considering the subscription costs of cloud services and the potential improvement of local models. For those evaluating on-premise deployments, there are trade-offs discussed in detail on /llm-onpremise.

Conclusions

The user expressed great satisfaction with the performance of Qwen3.5, emphasizing its potential for use in production environments and the possibility of reducing reliance on paid cloud services.

🔍 Continue Exploring

Qwen3.5: promising performance for real-world workloads

Performance and speed

Testing on real projects

Hybrid model and hardware considerations

Conclusions

💻 Need GPU Cloud Infrastructure?

💬 Comments (0)

🔍 Continue Exploring

Explore LLM On-Premise

MiniMax M2.7 on OpenRouter: 204,800 token context window

Qwen3.5-35B-A3B: Optimized GGUF for 24GB GPUs

Strix Halo: MiniMax Q3 K_XL Runs Surprisingly Fast

👥 Join 160+ AI explorers