LiveCodeBench (Jain et al., 2024) addresses a fundamental flaw in static code benchmarks: training contamination. Because LeetCode and Codeforces problems are widely distributed online, models trained after their release can memorise solutions. LiveCodeBench continuously collects new problems released only after each model's training cutoff, ensuring evaluation reflects genuine coding ability.
Design Philosophy
| Property | Detail |
|---|---|
| Sources | LeetCode (contest), Codeforces, AtCoder |
| Update frequency | Monthly (new contests added continuously) |
| Contamination prevention | Only problems released after model cutoff are used per model |
| Difficulty range | Easy / Medium / Hard (per platform ratings) |
| Metric | pass@1 (execution against hidden test cases) |
Why Contamination Matters
A model trained on data up to December 2024 may have seen solutions to all LeetCode problems published before that date — making its HumanEval and static LeetCode performance misleadingly high. LiveCodeBench's rolling window approach makes it one of the most trustworthy coding benchmarks available.
Scores (pass@1, Hard subset)
| Model | Hard pass@1 |
|---|---|
| o3 (2025) | 69.8% |
| Claude 3.7 Sonnet | 56.1% |
| GPT-4o | 35.1% |
| DeepSeek-R1 | 52.7% |
| Llama 3.1 70B | 22.4% |