The AI Programming Plateau: Why Merge Rates Have Stagnated Since 2025

An analysis of METR's SWE-bench data reveals that the rate at which LLMs produce mergeable code has remained stagnant since early 2025. Statistical modeling demonstrates that a constant performance level is more accurate than the suggested linear growth trend. This suggests that the actual programming utility of AI models is currently plateauing despite industry hype.

Key Points

LLMs pass automated tests significantly more often than they produce code that meets the 'mergeable quality' standards of human maintainers.
Statistical analysis using Brier scores shows that a constant performance model fits the historical data better than a linear growth model.
The data suggests that LLM programming abilities have effectively plateaued since early 2025, with no evidence of improvement in merge rates.
There is a notable disconnect between the perceived progress of AI models and their actual performance in rigorous, real-world programming benchmarks.
While newer models are claimed to be better, there is currently no measured evidence that they have broken the existing performance plateau.

Sentiment

The community largely disagrees with the article's pessimistic conclusion. While some acknowledge the statistical care taken, the dominant reaction is that the benchmark is too narrow, the dataset too small and incomplete, and that practitioner experience shows clear productivity gains contradicting the plateau narrative.

In Agreement

The gap between benchmark claims and real-world merge quality is meaningful—tests passing does not equal code humans would actually merge into their codebase.
Training data exhaustion makes further scaling on existing internet data increasingly difficult, lending credence to concerns about fundamental capability limits.
Statistical rigor using cross-validation and proper scoring rules is more appropriate than casual trend-eyeballing, and the data as presented does appear relatively flat.
The article identifies an important and often-ignored distinction between surface-level benchmark progress and actual usability improvements that affect real developers.

Opposed

The dataset is missing many recent high-performing models including Opus 4.0, Sonnet 4.5, Codex 5.3, and Gemini, making the plateau conclusion premature.
Real-world developer productivity has improved dramatically through agentic workflows, planning loops, persistent context, and CLI integration—dimensions that the benchmark does not capture.
Merge rates naturally follow step functions due to emergent behavior thresholds, not linear trends, so benchmark data appearing flat is expected and does not indicate stagnation.
Mixing results across different labs with only a handful of data points is insufficient for reliable trend analysis, and per-lab model lines show clear improvement.