Optimizing Qwen3.5 for Local Inference
A community user has shared the parameters they are using for the Qwen3.5 model, aiming to find the optimal configuration for local inference. The discussion focuses on using the model for general conversation tasks, excluding programming-related use cases.
Parameters and Configuration
The specified parameters include:
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Min-p: 0.00
- Presence penalty: 1.5
- Repeat penalty: 1.0
- Reasoning-budget: 1000
- Reasoning-budget-message: "... reasoning budget exceeded, need to answer.\n"
The user employs Q4_K_M quantization and the llama.cpp v8400 inference engine. Despite the configuration, the user finds that the model tends to "think too much", slowing down the deliveries.
๐ฌ Comments (0)
๐ Log in or register to comment on articles.
No comments yet. Be the first to comment!