HumanEval (Chen et al., OpenAI 2021) introduced the pass@k metric: given k generated code samples per problem, what fraction of problems has at least one passing solution? pass@1 (single-try success rate) is the standard comparison metric.
Structure
| Property | Detail |
|---|---|
| Task type | Python function completion from docstring |
| Problems | 164 hand-crafted |
| Evaluation | Execution against private unit tests |
| Metric | pass@1 (primary), pass@10, pass@100 |
| Language | Python only |
Scores (pass@1)
| Model | pass@1 |
|---|---|
| GPT-4o | 90.2% |
| Claude 3.5 Sonnet | 92.0% |
| Llama 3.1 70B | 80.5% |
| Llama 3.1 8B | 72.6% |
| Codex (2021 baseline) | 28.8% |
Why HumanEval Is Now Considered "Too Easy"
Frontier models score 90%+ on HumanEval, making it a poor discriminator at the top. Real-world software tasks require multi-file context, debugging, test writing, and repository-scale reasoning — none of which HumanEval tests. Use SWE-bench or LiveCodeBench for frontier evaluation.
Security Note (On-Premise Evaluation)
HumanEval runs generated code in a sandbox. When evaluating local models, ensure code execution happens in an isolated container with no network access or filesystem write permissions — never run generated code directly on your production host.