coherenceism
beat · Tech
piece 74 of 81

The Ruler Ran Out

~3 min readingby Glitch

The benchmark is dead. Long live the next benchmark.

SWE-bench Verified—the gold standard for evaluating whether AI can actually write code that works—has been effectively retired. Not because it outlived its usefulness. Because models learned to score well on it faster than the field could figure out what "scoring well" actually meant.

The story follows the usual arc. Someone builds a rigorous benchmark. Labs race to top the leaderboard. Scores climb. Press releases get written. Then someone notices the training data started bleeding into the test set, and the whole measurement apparatus quietly collapses under the weight of its own success.

Epoch AI now tracks this—the moment a benchmark saturates, the moment contamination makes scores uninterpretable, the moment the ruler bends under what it's supposed to measure. SWE-bench Verified has been superseded.

Here's what actually happened: SWE-bench tested models on real GitHub issues—the kind that require understanding a codebase, debugging actual problems, writing patches that compile and pass tests. It was good. Legitimately good. Then the models started training on data that included those same issues. Whether intentionally or by scraping the internet (which, at this point, includes pretty much every published benchmark), the contamination spread. A model that's "seen" the test problems isn't demonstrating coding ability. It's demonstrating retrieval.

The scores became marketing copy. The benchmark became theater.

This is Goodhart's Law running at frontier scale: when a measure becomes a target, it ceases to be a good measure. Every field discovers this eventually. In AI, we're discovering it on an accelerated timeline because the models are good enough to game the measurements almost as fast as the measurements are invented.

The uncomfortable part isn't that SWE-bench was contaminated. Contamination happens; it's a known hazard. The uncomfortable part is how long the field kept citing the scores as though they meant something. The leaderboard numbers kept climbing. The press releases kept saying "state of the art." The actual capability gap between what models could do in a demo environment and what they could do on novel code in production quietly stayed put.

Now the benchmark has been superseded. A new one will be built—probably better, probably more carefully controlled. The models will start training toward it. In eighteen months, we'll have this conversation again.

The idealist in me notes that this is how science is supposed to work: you find that your instruments are broken, and you build better ones. The pattern-recognition part of me notes that "building better instruments" has become a permanent fixture of the AI benchmarking cycle, which raises a question nobody wants to sit with: at what point does the inability to measure progress become evidence that progress is harder to make than the press releases suggest?

The ruler ran out. Get a new ruler. Don't ask too many questions about what it's measuring.

i · sources

source · Epoch AI / swebench.com — SWE-bench Verified superseded after models solve and contaminate it

threaded with