MBPP (Austin et al., Google Brain 2021) contains 500+ Python programming tasks collected via crowd-sourcing, ranging from simple string operations to basic algorithm implementations. Each task includes a description, 3 test cases used for evaluation, and 1 example test case visible in the prompt. It is consistently paired with HumanEval to give a broader picture of code generation capability.
MBPP vs HumanEval
| MBPP | HumanEval | |
|---|---|---|
| Origin | Crowd-sourced (diverse) | Hand-crafted by OpenAI |
| Problems | 500+ | 164 |
| Difficulty | Entry–Intermediate | Intermediate |
| Language | Python | Python |
| Format | Description → function | Docstring → function |
| Test cases per problem | 3 (hidden) | Variable (hidden) |
MBPP+ (Extended)
MBPP+ is a cleaned and augmented version with improved test coverage (80 tests per problem vs. 3), reducing the chance of passing with trivially wrong solutions. It is the preferred variant for serious evaluation.
Scores (pass@1)
| Model | MBPP+ pass@1 |
|---|---|
| GPT-4o | 89.7% |
| Claude 3.5 Sonnet | 91.6% |
| Llama 3.1 70B | 83.1% |
| Llama 3.1 8B | 72.8% |
| Code Llama 7B | 55.4% |