SWE-bench

Benchmark NEW

Software Engineering Benchmark — 2,294 real GitHub issues from popular Python repositories. Models must write a code patch that passes the repo's official test suite. The hardest coding benchmark.

SWE-bench (Jimenez et al., Princeton 2024) constructs tasks from real GitHub pull requests: the model receives a repository snapshot and an issue description, then must produce a code diff that resolves the issue and passes the test suite written by the original PR author. This requires understanding large codebases, not just completing isolated functions.

Structure

PropertyDetail
Total instances2,294 tasks (SWE-bench), 300 (Verified subset)
Repos12 popular Python repos (Django, Flask, Matplotlib, Requests, Scikit-learn…)
Metric% Resolved (patch passes all tests)
Context window requiredTypically 32K–200K tokens for full repo context

Why SWE-bench Changed Coding Evaluation

HumanEval tests toy functions. SWE-bench mirrors real engineering: identifying the root cause in thousands of lines of existing code, writing a minimal correct patch, and not breaking other tests. Resolution requires agentic loops — models need to search files, read tests, iteratively patch, and re-run test suites.

Scores (SWE-bench Verified, % Resolved)

Model / SystemResolved %
o3 (high compute)71.7%
Claude 3.7 Sonnet (agentic)70.3%
GPT-4o (agentic scaffold)49.0%
Claude 3.5 Sonnet (agentic)49.0%
Llama 3.1 405B (scaffold)28.1%
Unassisted GPT-4 (2024 baseline)1.7%

Agentic Scaffolding

Most SWE-bench results use an agent framework (tools for file reading, search, bash execution, test running) wrapped around the LLM. Bare model performance is much lower. This makes SWE-bench a benchmark for entire agentic systems, not just models — complicating apples-to-apples comparisons.