Discussion Forums



Thread: unreachable instance

Welcome, Guest Help
Login Login


Permlink Replies: 80 - Pages: 6 [ Previous | 1 2 3 4 5 6 ] - Last Post: Apr 8, 2008 12:38 AM by: Mr Ross Cooney
Thorsten von Eicken
RealName(TM)


Posts: 633
Registered: 3/24/06
Re: unreachable instance
Posted: Apr 7, 2008 9:40 AM PDT   in response to: Jorge Oliveira
  Click to reply to this thread Reply

I assume it's going to take some time for the EC2 and networking teams to go through the internal root cause analysis, but we're obviously very curious about how the difference availability zones fared / helped during the outage
Thorsten - http://www.rightscale.com


C. Oliver
RealName(TM)


Posts: 36
Registered: 4/22/07
Re: unreachable instance
Posted: Apr 7, 2008 11:26 AM PDT   in response to: Thorsten von Ei...
  Click to reply to this thread Reply

It seemed to effect all zones. I lost everything at the time. 5 ec'2 up at all times. Mixx of zones and was unable to contact them for 15mins to an hour for some.

imcaws

Posts: 7
Registered: 4/7/08
Re: unreachable instance
Posted: Apr 7, 2008 11:32 AM PDT   in response to: C. Oliver
  Click to reply to this thread Reply

Out of 17 active nodes we only had 4 become unreachable, so I think the issue was localized.


Kathrin@AWS

Posts: 163
Registered: 2/8/06
Re: unreachable instance
Posted: Apr 7, 2008 7:15 PM PDT   in response to: Kathrin@AWS
  Click to reply to this thread Reply

We want to provide you with some additional data on last night's EC2 connectivity issue.

This morning, at 1:46AM PDT, we began a maintenance change with one of our redundant internet access points.   Under normal circumstances, our network routing protocols automatically shift traffic away from these internet access points until a change is completed.   In this case, a latent, incorrect configuration caused affected instances to route outbound traffic to the degraded internet access point.   As a result, the outbound internet traffic routed via this internet access point was not successfully forwarded to the internet.  

Our monitoring correctly detected the initial loss of connectivity and the engineering teams were fully engaged within minutes.   Because this unidirectional failure was not an anticipated failure mode of our network topology, our monitoring and debugging did not help us identify the problem quickly.   The problem was correctly identified at 3:18AM PDT and traffic was forced away from the internet access point at 3:21AMPDT.   At this point, affected instances fully recovered.

Unlike previous EC2 networking issues, this issue affected instances in multiple Availability Zones.   It's worth reiterating that our Availability Zones are engineered to fail independently.   For example, EC2 Availability Zones do not share power transformers,generators or common cooling.   Each Availability Zone is physically separated from other Availability Zones to prevent correlated failure from events like fires, floods or physical damage to a datacenter.   Additionally, each Availability Zone has physically and logically redundant connections to multiple internet access points, and utilizes routing protocols to independently choose which internet access point to use.   This helps ensure high availability for each individual Availability Zone.   Today's event affected multiple Availability Zones because more than one Availability Zone independently routed some traffic to the faulty internet access point.

As with any operational issue, our internal post mortem process is helping us identify numerous ways that we can prevent this sort of issue in the future, and more generally, reduce our recovery time when the unexpected does happen.   In addition to correcting the immediate routing protocol issue, we have identified a failsafe that will help assure automatic failover of Availability Zones from degraded internet access points.   We have also identified a series of improvements to our internal networking monitoring that will help us more quickly isolate the cause of networking issues in this part of our infrastructure.

Sincerely,
The Amazon Web Services Team



Thorsten von Eicken
RealName(TM)


Posts: 633
Registered: 3/24/06
Re: unreachable instance
Posted: Apr 7, 2008 8:50 PM PDT   in response to: Kathrin@AWS
  Click to reply to this thread Reply

Ouch, and this happens just a few days after exposing availability zones, how humiliating. It proves again that anything that can go wrong will do so at the most inopportune time.
On the positive side, communication about the outage was very good and the post mortem was prompt and to the point. Proof that you're listening and improving. We still hear comments about the lack of an SLA. All I can say is that for me the best SLA is a track record and demonstrated commitment and ability to fix issues promptly and to eradicate root causes. So far you've done a very good job at that. Thank you and keep it up!
Thorsten - CTO, RightScale


Mr Ross Cooney

Posts: 8
Registered: 11/29/07
Re: unreachable instance
Posted: Apr 8, 2008 12:38 AM PDT   in response to: Kathrin@AWS
  Click to reply to this thread Reply

Thanks for the concise and detailed explanation Kathryn. 


Point your RSS reader here for a feed of the latest messages in all forums