Postmortem: Three Overlapping Infra Bugs Degraded Claude—Fixes Shipped, Evals and Tooling Upgraded

Anthropic diagnosed and fixed three overlapping infrastructure bugs—context window misrouting, TPU output corruption, and an XLA approximate top‑k miscompile—that intermittently degraded Claude’s responses. Amplified by an Aug 29 load-balancing change, the issues varied by model and platform; fixes were deployed between Sept 2 and Sept 16, with Bedrock routing remediation in progress. The company is instituting more sensitive, continuous production evaluations, better privacy-preserving debugging tools, and emphasizes it never reduces model quality due to demand.
Key Points
- Three overlapping infrastructure bugs (routing error, TPU output corruption, XLA approximate top‑k miscompilation) caused intermittent quality degradation; no intentional quality reductions occurred.
- A load-balancing change on Aug 29 dramatically amplified the routing bug’s user impact, peaking at 16% of Sonnet 4 requests in one hour; sticky routing worsened individual sessions.
- Output corruption on TPU servers sporadically elevated improbable tokens (e.g., Thai characters in English); it was rolled back Sept 2 and new detection tests were added.
- A latent XLA:TPU compiler issue in approximate top‑k caused incorrect token selection under certain conditions; Anthropic switched to exact top‑k with enhanced precision and is working with XLA on a fix.
- Detection was hindered by noisy evals, model self-recovery masking errors, and privacy constraints; Anthropic is rolling out more sensitive, continuous production evals and better debugging tools and solicits user feedback.
Sentiment
Mixed but leaning skeptical. While many commenters appreciate the transparency and technical depth of the postmortem, a significant contingent questions the completeness of the disclosure, expresses frustration over the lack of compensation, and doubts whether the full story has been told. There is a tension between those who defend Anthropic's openness as exceeding industry norms and those who see carefully worded minimization of the issues' scope and impact.
In Agreement
- Appreciation for Anthropic's willingness to share detailed technical infrastructure information publicly, something they normally do not do
- Recognition that these are genuinely difficult problems to detect and fix, especially given the non-deterministic nature of LLM outputs and the intermittent nature of the bugs
- Agreement that the bugs were infrastructure-level issues (routing, token sampling, compiler bugs) rather than intentional model degradation to save costs
- Acknowledgment that Claude remains a valuable enough product that users continue paying despite the issues, suggesting the core model quality is strong
- Praise for Anthropic's privacy practices, including the explicit consent mechanism for the thumbs-down feedback button and internal data access controls
- Recognition that all major AI providers are facing similar scaling challenges and reliability issues, not just Anthropic
Opposed
- Skepticism that response quality could drop due to infrastructure bugs and remain undetected for weeks, with some commenters calling it implausible
- Criticism that the postmortem lacked key details such as actual quality impact metrics and the rate of impact for the XLA bug, suggesting the omitted numbers were unfavorable
- Frustration that no credits or compensation were offered to paying users who received degraded service, especially those paying premium subscription prices
- Argument that the postmortem's remediation plan amounts to vague promises of better testing rather than concrete systemic solutions
- Concern that there are no real SLAs or accountability mechanisms for LLM output quality, making it impossible for customers to independently verify they are getting what they paid for
- Suspicion that the sticky routing issue meant affected users experienced far more degradation than the headline statistics suggest, with wording designed to minimize the apparent impact
- Criticism that the postmortem reveals a lack of basic unit testing for deterministic infrastructure components, suggesting insufficient SRE discipline at Anthropic