MT-Bench (Zheng et al., LMSYS 2023) consists of 80 multi-turn conversations (2 turns each) across writing, roleplay, coding, math, reasoning, STEM, humanities and extraction. A GPT-4 judge scores each response from 1–10. The paper introduced the LLM-as-judge paradigm, which has since become the dominant scalable evaluation methodology.
Categories
Writing
Essays, stories, arguments. Tests fluency, structure and creativity.
Roleplay
Persona adherence, improvisation, and maintaining character across turns.
Reasoning
Puzzles, logic problems, step-by-step analysis.
Math
Word problems and algebraic reasoning in conversational context.
Coding
Code generation and debugging across Python, JavaScript, SQL.
STEM Knowledge
Scientific explanation and analysis across physics, chemistry, biology.
LLM-as-Judge: Strengths and Risks
Using a language model (GPT-4) to rate language model outputs enables fast, cheap evaluation of open-ended text — previously requiring costly human annotation. Limitations: position bias (first answer rated higher), verbosity bias (longer = better), and self-reinforcement (GPT-4 as judge favours GPT-4 style). These biases must be controlled with randomised ordering and multi-annotator voting when publishing results.
Scores (avg. across turns)
| Model | MT-Bench Score (/10) |
|---|---|
| GPT-4 (original) | 8.99 |
| Claude 3.5 Sonnet | 9.18 |
| GPT-4o | 9.32 |
| Llama 3 70B Chat | 8.68 |
| Mistral 8×7B Instruct | 8.30 |
| Llama 2 70B Chat | 6.86 |