HumanEval – LLM Glossary

HumanEval (Chen et al., OpenAI 2021) introduced the pass@k metric: given k generated code samples per problem, what fraction of problems has at least one passing solution? pass@1 (single-try success rate) is the standard comparison metric.

Structure

Property	Detail
Task type	Python function completion from docstring
Problems	164 hand-crafted
Evaluation	Execution against private unit tests
Metric	pass@1 (primary), pass@10, pass@100
Language	Python only

Scores (pass@1)

Model	pass@1
GPT-4o	90.2%
Claude 3.5 Sonnet	92.0%
Llama 3.1 70B	80.5%
Llama 3.1 8B	72.6%
Codex (2021 baseline)	28.8%

Why HumanEval Is Now Considered "Too Easy"

Frontier models score 90%+ on HumanEval, making it a poor discriminator at the top. Real-world software tasks require multi-file context, debugging, test writing, and repository-scale reasoning — none of which HumanEval tests. Use SWE-bench or LiveCodeBench for frontier evaluation.

Security Note (On-Premise Evaluation)

HumanEval runs generated code in a sandbox. When evaluating local models, ensure code execution happens in an isolated container with no network access or filesystem write permissions — never run generated code directly on your production host.