LiveCodeBench

Benchmark NEW

Code generation benchmark built from problems released after model training cutoffs — preventing contamination. Continuously updated from LeetCode, Codeforces, and AtCoder.

LiveCodeBench (Jain et al., 2024) addresses a fundamental flaw in static code benchmarks: training contamination. Because LeetCode and Codeforces problems are widely distributed online, models trained after their release can memorise solutions. LiveCodeBench continuously collects new problems released only after each model's training cutoff, ensuring evaluation reflects genuine coding ability.

Design Philosophy

PropertyDetail
SourcesLeetCode (contest), Codeforces, AtCoder
Update frequencyMonthly (new contests added continuously)
Contamination preventionOnly problems released after model cutoff are used per model
Difficulty rangeEasy / Medium / Hard (per platform ratings)
Metricpass@1 (execution against hidden test cases)

Why Contamination Matters

A model trained on data up to December 2024 may have seen solutions to all LeetCode problems published before that date — making its HumanEval and static LeetCode performance misleadingly high. LiveCodeBench's rolling window approach makes it one of the most trustworthy coding benchmarks available.

Scores (pass@1, Hard subset)

ModelHard pass@1
o3 (2025)69.8%
Claude 3.7 Sonnet56.1%
GPT-4o35.1%
DeepSeek-R152.7%
Llama 3.1 70B22.4%