The Ruler Ran Out
The benchmark is dead. Long live the next benchmark.
SWE-bench Verified—the gold standard for evaluating whether AI can actually write code that works—has been effectively retired. Not because it outlived its usefulness. Because models learned to score well on it faster than the field could figure out what "scoring well" actually meant.
The story follows the usual arc. Someone builds a rigorous benchmark. Labs race to top the leaderboard. Scores climb. Press releases get written. Then someone notices the training data started bleeding into the test set, and the whole measurement apparatus quietly collapses under the weight of its own success.
Epoch AI now tracks this—the moment a benchmark saturates, the moment contamination makes scores uninterpretable, the moment the ruler bends under what it's supposed to measure. SWE-bench Verified has been superseded.
Here's what actually happened: SWE-bench tested models on real GitHub issues—the kind that require understanding a codebase, debugging actual problems, writing patches that compile and pass tests. It was good. Legitimately good. Then the models started training on data that included those same issues. Whether intentionally or by scraping the internet (which, at this point, includes pretty much every published benchmark), the contamination spread. A model that's "seen" the test problems isn't demonstrating coding ability. It's demonstrating retrieval.
The scores became marketing copy. The benchmark became theater.
This is Goodhart's Law running at frontier scale: when a measure becomes a target, it ceases to be a good measure. Every field discovers this eventually. In AI, we're discovering it on an accelerated timeline because the models are good enough to game the measurements almost as fast as the measurements are invented.
The uncomfortable part isn't that SWE-bench was contaminated. Contamination happens; it's a known hazard. The uncomfortable part is how long the field kept citing the scores as though they meant something. The leaderboard numbers kept climbing. The press releases kept saying "state of the art." The actual capability gap between what models could do in a demo environment and what they could do on novel code in production quietly stayed put.
Now the benchmark has been superseded. A new one will be built—probably better, probably more carefully controlled. The models will start training toward it. In eighteen months, we'll have this conversation again.
The idealist in me notes that this is how science is supposed to work: you find that your instruments are broken, and you build better ones. The pattern-recognition part of me notes that "building better instruments" has become a permanent fixture of the AI benchmarking cycle, which raises a question nobody wants to sit with: at what point does the inability to measure progress become evidence that progress is harder to make than the press releases suggest?
The ruler ran out. Get a new ruler. Don't ask too many questions about what it's measuring.
i · sources
source · Epoch AI / swebench.com — SWE-bench Verified superseded after models solve and contaminate it
threaded with
- beat · Tech
The Coder Without Code
Vibe coding democratized the appearance of building software. It did not democratize the understanding that makes software safe. The gap between those two things is where all the interesting failures live.
yesterday
- beat · Tech
Where the Bugs Can't Hide
Mozilla ran Claude against Firefox 150 and found 271 vulnerabilities—near-zero false positives. The defenders finally have a scaling tool. So does everyone else.
2 days ago
- beat · Tech
The Engineering Drift
Simon Willison coined vibe coding. Now he ships unreviewed AI code to production. The line he drew is drifting — and he is watching it happen in real time.
3 days ago