AWS US-EAST-1 Incident: NLB Health Subsystem and DNS Issues, Recovery Underway
AWS US-EAST-1 experienced broad service disruptions due to DNS issues (notably affecting DynamoDB) and an internal subsystem for monitoring NLB health. AWS stabilized the region by throttling EC2 launches, adjusting Lambda SQS polling, and working through backlogs, leading to steady recovery across services. As of the latest update, Lambda has fully recovered, EC2 launch throttles are being reduced, and queued events are expected to be fully processed soon.
Key Points
- Root causes included DNS resolution failures for DynamoDB and a broader internal subsystem issue monitoring NLB health, driving widespread connectivity/API errors in US-EAST-1.
- AWS throttled new EC2 instance launches to stabilize the region, impacting dependent services like ECS, RDS, and Glue.
- Lambda experienced invocation errors and adjusted SQS Event Source Mapping polling; these later recovered and were scaled back to normal.
- Backlogs accumulated for services (e.g., CloudTrail, EventBridge, SQS via Lambda) and were processed as recovery advanced.
- By early afternoon, most services showed significant recovery; Lambda was fully recovered, EC2 launch throttles were being reduced, and full normalization was ongoing.
Sentiment
The discussion is overwhelmingly critical of AWS, particularly regarding us-east-1's reliability history, AWS's communication during the incident, and the vendor lock-in that made failover impossible for many companies. However, there's a pragmatic undercurrent acknowledging that for most companies, AWS remains the practical choice despite incidents like this.
In Agreement
- us-east-1 is the most unreliable AWS region and this incident demonstrates the urgent need to either avoid it or have robust failover plans
- AWS's communication during the outage downplayed severity, calling it degradation while many users experienced complete outages for the entire business day
- The incident exposes dangerous circular dependencies in AWS architecture where DNS depends on DynamoDB and auth depends on the systems being authenticated
- Companies need to regularly test failover procedures — untested failover plans are functionally useless
- The cascading impact across dozens of dependent services demonstrates the fragility of modern cloud-dependent infrastructure
Opposed
- Other AWS regions were completely unaffected, suggesting the problem is specifically us-east-1 not AWS reliability in general
- Multi-cloud doesn't solve the problem and adds complexity — multi-region within AWS is sufficient
- Despite occasional outages, AWS is still more reliable than most organizations could achieve with self-hosted infrastructure
- Being on the same provider as everyone else is actually advantageous — shared outages are more forgivable than unique ones
- Running in us-east-1 is rational because it's cheapest, has the most features, and when it goes down everything goes down anyway