MBPP

Benchmark

Mostly Basic Python Programming — 500+ crowd-sourced Python problems. Simpler than HumanEval but broader coverage. Commonly paired with HumanEval for a more complete code eval picture.

MBPP (Austin et al., Google Brain 2021) contains 500+ Python programming tasks collected via crowd-sourcing, ranging from simple string operations to basic algorithm implementations. Each task includes a description, 3 test cases used for evaluation, and 1 example test case visible in the prompt. It is consistently paired with HumanEval to give a broader picture of code generation capability.

MBPP vs HumanEval

MBPPHumanEval
OriginCrowd-sourced (diverse)Hand-crafted by OpenAI
Problems500+164
DifficultyEntry–IntermediateIntermediate
LanguagePythonPython
FormatDescription → functionDocstring → function
Test cases per problem3 (hidden)Variable (hidden)

MBPP+ (Extended)

MBPP+ is a cleaned and augmented version with improved test coverage (80 tests per problem vs. 3), reducing the chance of passing with trivially wrong solutions. It is the preferred variant for serious evaluation.

Scores (pass@1)

ModelMBPP+ pass@1
GPT-4o89.7%
Claude 3.5 Sonnet91.6%
Llama 3.1 70B83.1%
Llama 3.1 8B72.8%
Code Llama 7B55.4%