MT-Bench

Benchmark

Multi-Turn Benchmark — 80 challenging multi-turn conversations across 8 categories, scored by GPT-4 as judge. Introduced the LLM-as-judge paradigm, enabling scalable open-ended evaluation.

MT-Bench (Zheng et al., LMSYS 2023) consists of 80 multi-turn conversations (2 turns each) across writing, roleplay, coding, math, reasoning, STEM, humanities and extraction. A GPT-4 judge scores each response from 1–10. The paper introduced the LLM-as-judge paradigm, which has since become the dominant scalable evaluation methodology.

Categories

Writing

Essays, stories, arguments. Tests fluency, structure and creativity.

Roleplay

Persona adherence, improvisation, and maintaining character across turns.

Reasoning

Puzzles, logic problems, step-by-step analysis.

Math

Word problems and algebraic reasoning in conversational context.

Coding

Code generation and debugging across Python, JavaScript, SQL.

STEM Knowledge

Scientific explanation and analysis across physics, chemistry, biology.

LLM-as-Judge: Strengths and Risks

Using a language model (GPT-4) to rate language model outputs enables fast, cheap evaluation of open-ended text — previously requiring costly human annotation. Limitations: position bias (first answer rated higher), verbosity bias (longer = better), and self-reinforcement (GPT-4 as judge favours GPT-4 style). These biases must be controlled with randomised ordering and multi-annotator voting when publishing results.

Scores (avg. across turns)

ModelMT-Bench Score (/10)
GPT-4 (original)8.99
Claude 3.5 Sonnet9.18
GPT-4o9.32
Llama 3 70B Chat8.68
Mistral 8×7B Instruct8.30
Llama 2 70B Chat6.86