Aws Ec2 North Virginia Outage Resolves But Some Issues Linger

It all began at 20:11 PDT, when the AWS status page announced the platform was suffering from degraded performance in its main availability zone.“Existing EC2 instances within the affected availability zone that use EBS volumes may also experience impairment due to stuck IO to the attached EBS volume(s),” a notice said 30 minutes later. “Newly launched EC2 instances within the affected availability zone may fail to launch due to the degraded volume performance.” “We continue to make progress in determining the root cause of the issue causing degraded performance for some EBS volumes in a single availability zone (USE1-AZ2) in the US-EAST-1 region. We have made several changes to address the increased resource contention within the subsystem responsible for coordinating storage hosts with the EBS service,” the notice at 22:16 PDT said. “While these changes have led to some improvement, we have not yet seen full recovery for the affected EBS volumes.” After a further 25 minutes, AWS said its mitigation had worked, was in process of deploying it fully, and EBS volumes should return to normal in the next hour. In the final report, at 4:21 AM PDT, AWS reported “the issue was caused by increased resource contention within the EBS subsystem responsible for coordinating EBS storage hosts. Engineering worked to identify the root cause and resolve the issue within the affected subsystem. At 11:20 PM PDT, after deploying an update to the affected subsystem, IO performance for the affected EBS volumes began to return to normal levels. By 12:05 AM on September 27th, IO performance for the vast majority of affected EBS volumes in the USE1-AZ2 Availability Zone were operating normally. However, starting at 12:12 AM PDT, we saw recovery slow down for a smaller set of affected EBS volumes as well as seeing degraded performance for a small number of additional volumes in the USE1-AZ2 Availability Zone.” AWS continued, “Engineering investigated the root cause and put in place mitigations to restore performance for the smaller set of remaining affected EBS volumes. These mitigations slowly improved the performance for the remaining smaller set of affected EBS volumes, with full operations restored by 3:45 AM PDT. While almost all of EBS volumes have fully recovered, we continue to work on recovering a remaining small set of EBS volumes. We will communicate the recovery status of these volumes via the Personal Health Dashboard. While the majority of affected services have fully recovered, we continue to recover some services, including RDS databases and Elasticache clusters. We will also communicate the recovery status of these services via the Personal Health Dashboard.” While AWS was experiencing issues, other sites were also hit with performance issues. “Hold tight, folks! Signal is currently down, due to a hosting outage affecting parts of our service. We’re working on bringing it back up,” the messaging service tweeted. Nest said its users had trouble logging in, but the situation was resolved. At the time of writing, Xero said it was suffering from slowness. To sum up, as Thaddeus E. Grugq, snarkily tweeted, “The internet was designed to survive nuclear wars, not AWS going down.” Update at 10 AM EDT, 27 September: Added further status update.

Related Coverage#

Related Coverage