We want to provide you with some additional data on last night's EC2 connectivity issue.
This morning, at 1:46AM PDT, we began a maintenance change with one of our redundant internet access points.
Under normal circumstances, our network routing protocols automatically shift traffic away from these internet access points until a change is completed.
In this case, a latent, incorrect configuration caused affected instances to route outbound traffic to the degraded internet access point.
As a result, the outbound internet traffic routed via this internet access point was not successfully forwarded to the internet.
Our monitoring correctly detected the initial loss of connectivity and the engineering teams were fully engaged within minutes.
Because this unidirectional failure was not an anticipated failure mode of our network topology, our monitoring and debugging did not help us identify the problem quickly.
The problem was correctly identified at 3:18AM PDT and traffic was forced away from the internet access point at 3:21AMPDT.
At this point, affected instances fully recovered.
Unlike previous EC2 networking issues, this issue affected instances in multiple Availability Zones.
It's worth reiterating that our Availability Zones are engineered to fail independently.
For example, EC2 Availability Zones do not share power transformers,generators or common cooling.
Each Availability Zone is physically separated from other Availability Zones to prevent correlated failure from events like fires, floods or physical damage to a datacenter.
Additionally, each Availability Zone has physically and logically redundant connections to multiple internet access points, and utilizes routing protocols to independently choose which internet access point to use.
This helps ensure high availability for each individual Availability Zone.
Today's event affected multiple Availability Zones because more than one Availability Zone independently routed some traffic to the faulty internet access point.
As with any operational issue, our internal post mortem process is helping us identify numerous ways that we can prevent this sort of issue in the future, and more generally, reduce our recovery time when the unexpected does happen.
In addition to correcting the immediate routing protocol issue, we have identified a failsafe that will help assure automatic failover of Availability Zones from degraded internet access points.
We have also identified a series of improvements to our internal networking monitoring that will help us more quickly isolate the cause of networking issues in this part of our infrastructure.
Sincerely,
The Amazon Web Services Team
|