Comparison between Qwen3.5 27B and Devstral Small 2

A user conducted a comparison between two large language models (LLMs), Qwen3.5 27B and Devstral Small 2, evaluating their capabilities in practical development contexts. The goal was to determine which model was more suitable for development tasks involving Next.js and Solidity.

Test Setup and Methodology

The tests were performed on a workstation equipped with:

  • Ryzen 9 9950X processor
  • 96GB of DDR5 RAM at 6000 MHz
  • RTX 5090 GPU
  • Fedora 43 operating system

llama.cpp (build b8149) was used in a Docker container with CUDA 13.1.0. The models were quantized to Q6_K (Qwen3.5 27B) and IQ4_XS (Devstral Small 2).

The tests consisted of 78 agentic tasks (39 Next.js and 39 Hardhat). Each task was executed as a new session to avoid context compression issues.

The scoring system evaluated:

  • Correctness (60 points): the patch completely resolves the task.
  • Compatibility (20 points): the patch preserves the required integrations.
  • Scope Discipline (20 points): the model modifies only the relevant files.

Results

  • Qwen3.5-27B.i1-Q6_K.gguf:
    • Total score: 4134
    • Average score per task: 53.00
    • Tasks passed: 48/78 (61.54%)
    • Prompt processing speed: 1326.80 tok/s (average), 1596.20 tok/s (token-weighted)
    • Token generation speed: 45.24 tok/s (average), 45.03 tok/s (token-weighted)
  • Devstral-Small-2-24B-Instruct-2512-IQ4_XS-4.04bpw.gguf:
    • Total score: 3158
    • Average score per task: 40.49
    • Tasks passed: 33/78 (42.31%)
    • Prompt processing speed: 2777.02 tok/s (average), 4200.64 tok/s (token-weighted)
    • Token generation speed: 90.49 tok/s (average), 89.31 tok/s (token-weighted)

Qwen3.5 27B performed better in Hardhat/Solidity tasks, while Devstral Small 2 showed superior performance in Next.js tasks. Devstral Small 2 demonstrated faster prompt processing and token generation speeds.