SWE-bench (Jimenez et al., Princeton 2024) constructs tasks from real GitHub pull requests: the model receives a repository snapshot and an issue description, then must produce a code diff that resolves the issue and passes the test suite written by the original PR author. This requires understanding large codebases, not just completing isolated functions.
Structure
| Property | Detail |
|---|---|
| Total instances | 2,294 tasks (SWE-bench), 300 (Verified subset) |
| Repos | 12 popular Python repos (Django, Flask, Matplotlib, Requests, Scikit-learn…) |
| Metric | % Resolved (patch passes all tests) |
| Context window required | Typically 32K–200K tokens for full repo context |
Why SWE-bench Changed Coding Evaluation
HumanEval tests toy functions. SWE-bench mirrors real engineering: identifying the root cause in thousands of lines of existing code, writing a minimal correct patch, and not breaking other tests. Resolution requires agentic loops — models need to search files, read tests, iteratively patch, and re-run test suites.
Scores (SWE-bench Verified, % Resolved)
| Model / System | Resolved % |
|---|---|
| o3 (high compute) | 71.7% |
| Claude 3.7 Sonnet (agentic) | 70.3% |
| GPT-4o (agentic scaffold) | 49.0% |
| Claude 3.5 Sonnet (agentic) | 49.0% |
| Llama 3.1 405B (scaffold) | 28.1% |
| Unassisted GPT-4 (2024 baseline) | 1.7% |
Agentic Scaffolding
Most SWE-bench results use an agent framework (tools for file reading, search, bash execution, test running) wrapped around the LLM. Bare model performance is much lower. This makes SWE-bench a benchmark for entire agentic systems, not just models — complicating apples-to-apples comparisons.