MT-Bench

MT-Bench (Zheng et al., LMSYS 2023) consists of 80 multi-turn conversations (2 turns each) across writing, roleplay, coding, math, reasoning, STEM, humanities and extraction. A GPT-4 judge scores each response from 1–10. The paper introduced the LLM-as-judge paradigm, which has since become the dominant scalable evaluation methodology.

LLM-as-Judge: Strengths and Risks

Using a language model (GPT-4) to rate language model outputs enables fast, cheap evaluation of open-ended text — previously requiring costly human annotation. Limitations: position bias (first answer rated higher), verbosity bias (longer = better), and self-reinforcement (GPT-4 as judge favours GPT-4 style). These biases must be controlled with randomised ordering and multi-annotator voting when publishing results.

Scores (avg. across turns)

Model	MT-Bench Score (/10)
GPT-4 (original)	8.99
Claude 3.5 Sonnet	9.18
GPT-4o	9.32
Llama 3 70B Chat	8.68
Mistral 8×7B Instruct	8.30
Llama 2 70B Chat	6.86

Categories

Writing

Roleplay

Reasoning

Math

Coding

STEM Knowledge

LLM-as-Judge: Strengths and Risks

Scores (avg. across turns)