AWS US-EAST-1 Incident: NLB Health Subsystem and DNS Issues, Recovery Underway
AWS US-EAST-1 experienced broad service disruptions due to DNS issues (notably affecting DynamoDB) and an internal subsystem for monitoring NLB health. AWS stabilized the region by throttling EC2 launches, adjusting Lambda SQS polling, and working through backlogs, leading to steady recovery across services. As of the latest update, Lambda has fully recovered, EC2 launch throttles are being reduced, and queued events are expected to be fully processed soon.
Key Points
- Root causes included DNS resolution failures for DynamoDB and a broader internal subsystem issue monitoring NLB health, driving widespread connectivity/API errors in US-EAST-1.
- AWS throttled new EC2 instance launches to stabilize the region, impacting dependent services like ECS, RDS, and Glue.
- Lambda experienced invocation errors and adjusted SQS Event Source Mapping polling; these later recovered and were scaled back to normal.
- Backlogs accumulated for services (e.g., CloudTrail, EventBridge, SQS via Lambda) and were processed as recovery advanced.
- By early afternoon, most services showed significant recovery; Lambda was fully recovered, EC2 launch throttles were being reduced, and full normalization was ongoing.
Sentiment
The overall sentiment of the Hacker News discussion is highly critical, cynical, and frustrated regarding AWS's reliability, particularly for the US-EAST-1 region. Users express deep skepticism about AWS's official status updates and perceived efforts to downplay the severity of the outage. There's a strong undercurrent of resignation to US-EAST-1's consistent unreliability and exasperation with its critical dependencies on global services. While acknowledging the difficulty of cloud operations, many users criticize AWS's business practices, internal culture, and perceived decline in engineering excellence. The discussion reflects a significant lack of trust in AWS's claims of 'rock-solid reliability' and a widespread belief that the incident exposes fundamental flaws in cloud centralization.
In Agreement
- The incident confirmed the widespread impact across many AWS services, aligning with AWS's own list of affected services and user reports of API errors and increased latency.
- The initial identification of DNS resolution issues affecting DynamoDB, and later the internal NLB health monitoring subsystem, were largely accepted as the technical root causes, even if their broader implications were debated.
- AWS's actions to throttle new EC2 instance launches and adjust Lambda SQS polling rates were observed by users through persistent issues with launching new instances and function invocation errors.
- Many users corroborated the advice to flush DNS caches if DynamoDB resolution problems persisted, indicating a shared understanding of the technical issue.
- The US-EAST-1 region was widely acknowledged as frequently experiencing outages, supporting the article's context of an incident occurring in this specific region.
Opposed
- Many users strongly disagreed with AWS's characterization of the event as mere 'degradation' or 'recovering,' asserting that for their operations, it constituted a 'total lie' and a 'full outage' lasting much longer than AWS indicated.
- The discussion highlighted AWS's perceived failure to uphold its reliability promises and SLAs, particularly for new instance launches, and suggested that AWS often 'weasels out' of formal downtime classifications.
- A central opposing view was that US-EAST-1 acts as a critical single point of failure (SPOF) for many AWS 'global' services (e.g., IAM, STS, Route53, CloudFront, billing/support systems), which contradicts the promised regional independence and impacts services even outside US-EAST-1.
- Users criticized AWS's internal architecture, suggesting tech debt, cultural shifts (e.g., focus on sales over engineering, AI push, layoffs), and insufficient investment in core reliability were contributing factors to persistent outages.
- The high cost and complexity of implementing multi-region or multi-cloud resilience, as frequently recommended by AWS, were seen as disingenuous, given that AWS itself allegedly fails to fully implement such redundancies for its own foundational services.
- The notion that customers are 'off the hook' if AWS goes down (because 'everyone else is down too') was challenged by some, who argued for greater corporate responsibility in ensuring uptime for critical services, even if it costs more.