Claude Code Opus 4.5 Shows Significant 30-Day Performance Dip

An independent tracker finds Claude Code Opus 4.5 has slipped from a 58% baseline to 54% over the past 30 days, a statistically significant −4.1% drop. Daily and weekly changes (50% and 53%) are below significance thresholds. Results come from daily SWE-Bench-Pro runs in Claude Code CLI with 95% CI-based testing.

Key Points

Baseline pass rate is 58%; current 30-day pass rate is 54% across 655 evaluations.
A statistically significant 30-day degradation of −4.1% is detected (exceeds ±3.4% threshold).
Daily (50%) and 7-day (53%) declines are not statistically significant given their larger thresholds (±14.0% and ±5.6%).
Evaluations run daily on a curated subset of SWE-Bench-Pro via Claude Code CLI with the latest Opus 4.5 model, without custom harnesses.
Statistical methodology models passes as Bernoulli trials with 95% CIs; the tracker provides email alerts on significant drops.

Sentiment

The community is largely skeptical of Anthropic's assurances and believes some form of quality degradation is occurring, whether intentional or not. While there is strong support for the concept of independent benchmarking, many technically sophisticated commenters note the current methodology needs significant improvement. The overall tone is one of frustrated trust erosion rather than outright hostility, with users wanting transparency and accountability from AI providers they depend on professionally.

In Agreement

Many users report tangible, reproducible degradation in Claude Code quality, especially during peak US hours, validating the benchmark's findings
Third-party independent benchmarking of AI services is crucial because providers have financial incentives to quietly reduce quality through quantization, reduced thinking tokens, or model swapping
Anthropic's carefully worded denial about never reducing 'model quality' leaves room for reducing output quality through other means like thinking time or harness changes
The benchmark caught a real harness regression that Anthropic confirmed, demonstrating the value of continuous external monitoring
Context compaction in Claude Code renders sessions nearly useless, contributing to perceived quality drops

Opposed

The benchmark methodology is flawed: running only 50 tasks once daily with a plus-or-minus 14% daily significance threshold makes the daily metric essentially meaningless
Perceived degradation likely reflects the honeymoon-hangover effect where users notice flaws more over time, natural LLM non-determinism, or rising expectations rather than actual model changes
GPU floating-point non-determinism and batch processing effects are natural sources of variance that do not indicate intentional degradation
Antirez argued the oscillating percentage pattern is inconsistent with secret model swapping, which would produce sharp square-wave drops to a lower model's baseline
Multiple Anthropic employees stated on record they never reduce model intelligence, and the confirmed issue was a harness bug not a model quality reduction