SWE-bench – LLM Glossary

SWE-bench (Jimenez et al., Princeton 2024) constructs tasks from real GitHub pull requests: the model receives a repository snapshot and an issue description, then must produce a code diff that resolves the issue and passes the test suite written by the original PR author. This requires understanding large codebases, not just completing isolated functions.

Structure

Property	Detail
Total instances	2,294 tasks (SWE-bench), 300 (Verified subset)
Repos	12 popular Python repos (Django, Flask, Matplotlib, Requests, Scikit-learn…)
Metric	% Resolved (patch passes all tests)
Context window required	Typically 32K–200K tokens for full repo context

Why SWE-bench Changed Coding Evaluation

HumanEval tests toy functions. SWE-bench mirrors real engineering: identifying the root cause in thousands of lines of existing code, writing a minimal correct patch, and not breaking other tests. Resolution requires agentic loops — models need to search files, read tests, iteratively patch, and re-run test suites.

Scores (SWE-bench Verified, % Resolved)

Model / System	Resolved %
o3 (high compute)	71.7%
Claude 3.7 Sonnet (agentic)	70.3%
GPT-4o (agentic scaffold)	49.0%
Claude 3.5 Sonnet (agentic)	49.0%
Llama 3.1 405B (scaffold)	28.1%
Unassisted GPT-4 (2024 baseline)	1.7%

Agentic Scaffolding

Most SWE-bench results use an agent framework (tools for file reading, search, bash execution, test running) wrapped around the LLM. Bare model performance is much lower. This makes SWE-bench a benchmark for entire agentic systems, not just models — complicating apples-to-apples comparisons.