MiniMax has introduced M2.7, its latest model version, subjecting it to benchmarks focused on autonomous coding.

Benchmark Results

M2.7 was evaluated using two main benchmarks:

  • PinchBench: In this test, focused on standardized OpenClaw agent tasks, M2.7 scored 86.2%, placing fifth overall, close to models like GLM-5 and GPT-5.4.
  • Kilo Bench: This benchmark, composed of 89 tasks, evaluates autonomous coding capabilities in various fields, from Git operations to cryptanalysis. M2.7 passed 47% of the tasks, demonstrating a distinctive behavioral profile.

A more in-depth analysis of the Kilo Bench revealed that M2.7 tends to extensively examine the context before intervening, analyzing dependencies and tracing call chains. This approach is advantageous in tasks that require a thorough understanding, but can lead to timeouts in more urgent situations. It is interesting to note how each model tested solved unique tasks, highlighting the complementarity between different architectures.

Token Efficiency and Costs

Compared to other available models, M2.7 stands out for its lower cost ($0.30/M input and $1.20/M output) while offering competitive performance in certain scenarios. However, its tendency towards greater context exploration translates into longer execution times compared to its predecessors.

For those evaluating on-premise deployments, there are trade-offs to consider carefully. AI-RADAR offers analytical frameworks on /llm-onpremise to evaluate these aspects.