HumanEval

Benchmark

OpenAI's code generation benchmark: 164 hand-crafted Python programming problems. Models must write a function that passes all unit tests. The canonical LLM coding benchmark.

HumanEval (Chen et al., OpenAI 2021) introduced the pass@k metric: given k generated code samples per problem, what fraction of problems has at least one passing solution? pass@1 (single-try success rate) is the standard comparison metric.

Structure

PropertyDetail
Task typePython function completion from docstring
Problems164 hand-crafted
EvaluationExecution against private unit tests
Metricpass@1 (primary), pass@10, pass@100
LanguagePython only

Scores (pass@1)

Modelpass@1
GPT-4o90.2%
Claude 3.5 Sonnet92.0%
Llama 3.1 70B80.5%
Llama 3.1 8B72.6%
Codex (2021 baseline)28.8%

Why HumanEval Is Now Considered "Too Easy"

Frontier models score 90%+ on HumanEval, making it a poor discriminator at the top. Real-world software tasks require multi-file context, debugging, test writing, and repository-scale reasoning — none of which HumanEval tests. Use SWE-bench or LiveCodeBench for frontier evaluation.

Security Note (On-Premise Evaluation)

HumanEval runs generated code in a sandbox. When evaluating local models, ensure code execution happens in an isolated container with no network access or filesystem write permissions — never run generated code directly on your production host.