YearMonthDayTimeEnd YearEnd MonthEnd DayEnd TimeDisplay DateHeadlineTextMediaMedia CreditMedia CaptionMedia ThumbnailTypeGroupBackground
AWS OutagesThis is a list of the most critical AWS outages that happened over the years since this cloud provider was created. As it turned out, even that AWS exists since 2006, it's not a long list! Let's review it together...
20087202008720Amazon S3 Availability Event (global)This event affected the Amazon S3 availability soon after releasing European regions, where a root cause is related to an invalid state that was replicated between servers (a single bit corruption). This resulted in severe degradation of the gossip protocol and disallowed the completion of user's requests. SUMMARY:
20114292011429Amazon EC2 and Amazon RDS Service Disruption in the US East Region (us-east-1)The root cause of this outage in a single Availability Zone of the infamous us-east-1 is the Amazon EBS failure that disallowed writes and reads operations and created "stuck" volumes. Yikes, luckily they've unstuck them quickly! 😉 SUMMARY:
201164201164AWS Service Event in the Sydney Region (ap-southeast-2)The service disruption primarily affected EC2 instances and their associated Amazon EBS volumes running in a single Availability Zone. Root cause? Loss of Power. Sounds typical? Maybe, but it was related to a loss of power at a regional substation due to severe weather in the area. In one of the facilities, AWS power redundancy didn't work as designed, and they have lost power to a significant number of instances in that Availability Zone. After some years, James Hamilton, during AWS re:Invent, will refer to that situation as a trigger to design novel and more robust power redundancy mechanisms in AWS data centers. SUMMARY:
2012629201272AWS Service Event in the US East Region (us-east-1)This event was triggered during a large-scale electrical storm that swept through the Northern Virginia area. That wouldn't be a problem if it would not coincide with generators and electrical switching equipment failure in the data centers (all the same brand installed in 2010 and 2011). SUMMARY:
2012102220121022AWS Service Event in the US-East Region (us-east-1)Another failure of the infamous Amazon EBS in the infamous us-east-1 region! The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers. Each EBS storage server has an agent that contacts a set of data collection servers and reports information used for fleet maintenance. The data collected with this system is important, but the collection is not time-sensitive, and the system is designed to be tolerant of late or missing data. Last week, one of the data collection servers in the affected Availability Zone had a hardware failure and was replaced. As part of replacing that server, a DNS record was updated to remove the failed server and add the replacement server. While not noticed at the time, the DNS update did not successfully propagate to all of the internal DNS servers, and as a result, a fraction of the storage servers did not get the updated server address and continued to attempt to contact the failed data collection server. Because of the data collection service (which is tolerant to missing data), this did not cause any immediate issues or set off any alarms. However, this inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers. Rather than gracefully deal with the failed connection, the reporting agent continued trying to contact the collection server in a way that slowly consumed system memory. While we monitor aggregate memory consumption on each EBS Server, our monitoring failed to alarm this memory leak. Now you get the idea of what happened next! SUMMARY:
2012122420121224Amazon ELB Service Event in the US-East Region (us-east-1)This had to be a sad Christmas for some customers! Accordingly to the post-event summary: "While the service disruption only affected applications using the ELB service (and only a fraction of the ELB load balancers were affected), the impacted load balancers saw significant impact for a prolonged period of time.". Root cause? An internal bug that appeared after specific ELB reconfiguration. SUMMARY:
2013121720131217Summary of the December 17th event in the South America Region (sa-east-1)Power failures are hitting the AWS cloud really hard! This time Sao Paolo. Root cause? Cascading failure of primary and backup generators. Instances in the second Availability Zone in the Region did not experience any power-related issues. However, instances in both Availability Zones did experience 20 minutes of degraded network connectivity due to an error made in bringing our network back online once power was restored. SUMMARY:
20146132014613Amazon SimpleDB Service Disruption (us-east-1)Due to a power outage that affected some critical components of the Amazon SimpleDB distributed system architecture (distributed lock engine), AWS observed a cascading failure caused by the increased amount of timeouts and duration for the requests. That resulted in an elevated number of 500 responses to the end-users. SUMMARY:
201487201487Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region (eu-west-1)A primary power outage, caused by a 110kV 10 MW transformer failure, resulted in a longer than usual recovery of the Amazon EBS storage nodes. Some nodes operated correctly. However, some of them stayed in a stuck state. Some nodes were also in an inconsistent state that required restoring data from recovery snapshots and consolidating volumes which affected Amazon EC2 and Amazon RDS in that region. SUMMARY:
20159202015920Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region (us-east-1)A brief network disruption impacted a portion of DynamoDB’s storage servers. Usually, this type of networking disruption is handled seamlessly and without change to the performance of DynamoDB, as affected storage servers query the metadata service for their membership, process any updates, and reconfirm their availability to accept requests. But, on Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some storage servers could not obtain their membership data and removed themselves from taking requests. Why? Massive success of the new feature - Global Secondary Indexes (GSIs) and lack of the proper capacity that would handle the reconciliation after a network failure. This outage affected Amazon Route53, Amazon SQS, EC2 Auto Scaling, Amazon CloudWatch, and some other services as well. SUMMARY:
201664201664Amazon EC2 Service Disruption in the Sydney Region (ap-southeast-2)This was a big one in Asia-Pacific region! Many major companies in Australia spent Sunday night scrambling after the bad weather fried hardware in one of Amazon's Sydney data centres (power outage), sending Amazon EC2 and Amazon EBS instances in one of its availability zones offline and creating problems for other AWS services including Amazon Elasticsearch and internal DNS. API call failures in the affected availability zone also meant that those hosted there were unable to failover elsewhere, despite having multi-zone redundancy in place for such events. SUMMARY:
20172282017228Amazon S3 Service Disruption in the Northern Virginia Region (us-east-1)One of the biggest failures in the history of cloud computing! Amount of users and services affected is hard to imagine. Root cause? Combination of the invalid parameter accepted by an internal tool, operator's mistake, untested recovery procedures, not updated playbooks, and unprecedented scale to operate. This outage affected many services: Amazon CloudFront, Amazon EBS, Amazon EC2, EC2 Auto Scaling, Amazon CloudWatch, Personal Health Dashboard, even the main status page, which was hosted on Amazon S3. How did it start? Amazon S3 team was debugging an issue, causing the S3 billing system to progress more slowly than expected... SUMMARY:
2018112220181122Amazon EC2 DNS Resolution Issues in the Seoul Region (ap-northeast-2)Ugh, DNS issues are the worst! The root cause of DNS resolution issues was a configuration update that incorrectly removed the setting that specifies the minimum healthy hosts for the EC2 DNS resolver fleet in the AP-NORTHEAST-2 Region. This resulted in the minimum healthy hosts configuration setting being interpreted as a very low default value that resulted in fewer in-service healthy hosts. With the reduced healthy host capacity for the EC2 DNS resolver fleet, DNS queries from within EC2 instances began to fail. SUMMARY:
20198242019824Amazon EC2 and Amazon EBS Service Event in the Tokyo Region (ap-northeast-1)What happened? Cooling failure caused a small percentage of EC2 servers in a single AZ in Tokyo to shut down due to overheating. This resulted in impaired EC2 instances and degraded EBS volume performance for some resources in the affected area of the Availability Zone. A control system failure caused multiple redundant cooling systems to fail in parts of the affected Availability Zone. SUMMARY:
2020112520201125Amazon Kinesis Event in the Northern Virginia Region (us-east-1)This never happened in the past! It all started after the event of a relatively small addition of capacity to the service. It looked like a memory pressure at first. However, after narrowing a root cause, it turned out this wasn’t driven by memory pressure. Instead, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. The damage was very severe as it affected Amazon ECS, Amazon EKS, Amazon CloudWatch (including Events), Amazon EventBridge, EC2 Auto Scaling, AWS Lambda, Service Health Dashboard, and Amazon Cognito. The timing was unfortunate as well, as it was during the black week, right before Black Friday. 😱 SUMMARY:
202192202192AWS Direct Connect Event in the Tokyo Region (ap-northeast-1)Interesting case: due to the failure of a subset of network devices on one of the network layers customers in Tokyo region observed intermittent connectivity issues and elevated packet loss for their traffic destinations. Failure occured along the network path from AWS Direct Connect edge locations to the data center network in the Tokyo Region, where customers’ Virtual Private Clouds (VPCs) reside. Real reason was hidden much deeper: engineers suspected that the failure may be related to a new protocol that was introduced to optimize the network’s reaction time to infrequent network convergence events and fiber cuts. This new protocol was introduced many months prior and this change had been in production since then without any issues - so it wasn't an obvious culprit, however disabling the new protocol resolved the event. After that engineering teams have continued working to identify the underlying root cause. They've confirmed that this event was caused by a latent issue within the network device operating system - version of the operating system enabled a new protocol which is used to improve the failover time of our network, but engineers have been able to identify the defect in the network operating system and determined that it requires a very specific set of packet attributes and contents to trigger the issue. Those were really unlucky conditions! SUMMARY:
202192620219268 hours long service impairment because a "Stuck IO" in Amazon Elastic Block Store (EBS) in Northern Virginia Region (us-east-1)This one is an example of cascading failure - it started from one place and impaired several dependent services (like Amazon RDS, Amazon ElastiCache, or Amazon Redshift clusters). On Sunday 26th of September 2021, when Amazon EBS experienced degraded performance in one availability zone in the infamous us-east-1 region. An update described the issue as "Stuck IO" and warned that existing EC2 instances may "experience impairment", plus new EC2 instances could fail. This obviously affected the dependent services that rely on Amazon EBS. This is a friendly reminder that friends do not let friends use us-east-1, and that Amazon EBS is a disk volume that is attached to our instances through the network and it is AZ dependent. RegisterPOST-EVENT SUMMARY:
20211272021127AWS Networking Control Plane Outage in Northern Virginia Region (us-east-1)This outage is like this old haiku: "It's not DNS, there is no way it's DNS, it was DNS". At least partially - it was an internal DNS and monitoring systems failure, due to a traffic congestion caused by the unexpected operation right after an automated saling activity side. Lack of monitoring did not help and severily impaired the debugging and fixing activities. In order to avoid such events in future AWS is preparing a more isolated environment for those internal networking components. SUMMARY:
2021121520211215AWS Networking Control Plane Congestion in US West Regions (us-west-1 and us-west-2)So on a first glance, it looked like another full-on AWS outage, as we saw earlier this month. On a status page we could see that on 07:48 PT (15:48 UTC) us-west-2 region was experiencing connectivity problems, and similarly for us-west-1 at 07:52 PT (15:52 UTC). In ten minutes AWS engineers found the root cause of the loss of connectivity to the regions, introduced remediation, and we had seen some recovery. The total time of the outage was around 30 minutes and it affected the connectivity between AWS Backbone and external ISPs. Connectivity within the region was not affected by this event. Phew! 😅 RegisterPOST-EVENT SUMMARY: