MBPP – LLM Glossary

MBPP (Austin et al., Google Brain 2021) contains 500+ Python programming tasks collected via crowd-sourcing, ranging from simple string operations to basic algorithm implementations. Each task includes a description, 3 test cases used for evaluation, and 1 example test case visible in the prompt. It is consistently paired with HumanEval to give a broader picture of code generation capability.

MBPP vs HumanEval

	MBPP	HumanEval
Origin	Crowd-sourced (diverse)	Hand-crafted by OpenAI
Problems	500+	164
Difficulty	Entry–Intermediate	Intermediate
Language	Python	Python
Format	Description → function	Docstring → function
Test cases per problem	3 (hidden)	Variable (hidden)

MBPP+ (Extended)

MBPP+ is a cleaned and augmented version with improved test coverage (80 tests per problem vs. 3), reducing the chance of passing with trivially wrong solutions. It is the preferred variant for serious evaluation.

Scores (pass@1)

Model	MBPP+ pass@1
GPT-4o	89.7%
Claude 3.5 Sonnet	91.6%
Llama 3.1 70B	83.1%
Llama 3.1 8B	72.8%
Code Llama 7B	55.4%