Discussion Forums



Thread: Unresponsive EC2 instance

This question is answered. Helpful answers available: 1. Correct answers available: 1.

Welcome, Guest Help
Login Login


Permlink Replies: 36 - Pages: 3 [ Previous | 1 2 3 ] - Last Post: Jun 11, 2009 6:29 PM by: JoeJ@AWS
Luke@AWS

Posts: 84
Registered: 3/26/08
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 2:11 AM PDT   in response to: pjanakiraman
Helpful
  Click to reply to this thread Reply

This issue has been resolved and power and connectivity has been restored to the affected instances. We have confirmed that all instances listed in this thread have been restored with the exception of one instance that was terminated by the user and a second instance that was running on a degraded host. The owner of the latter instance was contacted via email. The rest of the instances lost power and experienced a reboot. Recovered instances have both their EBS and instance stores. If you have any remaining issues or questions about your instances, please feel free to contact us on this thread.


Shlomo Swidler

Posts: 2,112
Registered: 7/10/08
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 5:05 AM PDT   in response to: rtdev
 
  Click to reply to this thread Reply

The /mnt disk should remain intact across reboots. Perhaps it *used to* be wiped, but no longer.


rtdev

Posts: 147
Registered: 3/28/08
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 6:06 AM PDT   in response to: Luke@AWS
 
  Click to reply to this thread Reply

> Recovered instances have both their EBS and instance stores.

Hi Luke - in our case we lost one of our five EBS volumes when our instance was rebooted during the recovery operation performed by Amazon.  I have an open item that Danny is researching here: http://developer.amazonwebservices.com/connect/thread.jspa?messageID=131633&#131633

Thank you.


rtdev

Posts: 147
Registered: 3/28/08
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 6:07 AM PDT   in response to: pjanakiraman
 
  Click to reply to this thread Reply

Luke and all - It is disappointing to see a failure like this but I think I speak for many when I say it is great see how responsive the Amazon team was at resolving this.

That being said - I was under the impression that your architecture had more resiliency built into it. Yes we can use multiple availability zones to help with a single point of failure, but I thought that even within a single availability zone there was not a single point of failure for hardware / power.  Can you please let us know if this will be addressed to prevent such issues in the future?

Also can you elaborate on whether the instances themselves lost power or if it was just the equipment that connects the instances to the net?  It was unclear from the Dashboard update.  If it was the instances themselves that lost power, than I was surprised they could be recovered with /mnt still in tack (I thought that /mnt and all non-persistent parts of an image would be lost should a box lose power, no?)

Also special thanks to your team for the frequent updates on the dashboard. In the future it would be helpful if a rough estimate of ETA can be provided - for instance even a "it could be several hours" or "its possible things will be restored within a couple of hours" type of messages help us to plan.  Lastly I don't know if you've thought of this but it would be great if Amazon used twitter to keep us posted.  Obviously when everyone is hard at work to bring a system up the last thing they want is to take time away from fixing the issue to talk about it, but some occassional quick blurbs like "the tech just arrived - replacing bad parts" would be incredible updates to have during an outage.

Thanks again.





filife

Posts: 71
Registered: 7/17/08
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 7:27 AM PDT   in response to: Luke@AWS
 
  Click to reply to this thread Reply

Luke,

Sorry to be a pain, but you guys still haven't answered my question with regards to PDU redundancy.  Going to be doing a post-mortem on this and need to explain to my management team why our "cloud" was taken out by "lightning".

Thanks

Chris


C. Oliver
RealName(TM)


Posts: 36
Registered: 4/22/07
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 10:47 AM PDT   in response to: pjanakiraman
 
  Click to reply to this thread Reply

I'd like to hear more what happened as well. How does a single PDU have redundancy? What was it doing controlling 2+ racks? This is the only "support" I have I don't pay extra for Amazon's"support", But thanks for calling me and asking me to sign up for it.

 This forum and the stat page is all we get for our $2k a month amazon hosting.

Message was edited by: C. Oliver

JoeJ@AWS

Posts: 158
Registered: 5/30/08
Re: Unresponsive EC2 instance
Posted: Jun 11, 2009 6:29 PM PDT   in response to: filife
 
  Click to reply to this thread Reply


We wanted to provide some additional clarity in response to the questions around the power loss that affected some instances yesterday.  We have redundant power in our datacenters, which performed as designed when we lost power in the lightning storm, except for a single component powering the affected hosts.  Within each of our Availability Zones there are many Power Distribution Units ("PDUs") that independently provide redundant power to subsets of hosts in that zone.  The PDUs use Uninterruptible Power Supplies ("UPSs'") to smoothly switch between utility and backup power without interruption.  When this Availability Zone lost utility power during yesterday's storm, our power failover worked as expected, and all but one of the PDUs successfully switched to backup power with no impact.  However, one UPS on one PDU sustained damage during the storm, which prevented it from properly switching power sources; thus the PDU to which it was attached lost power.  Because of the nature of the failure and the design of the particular UPS, we were unable to manually switch power over immediately.  This delayed our ability to bring power back to the affected hosts and to reboot the affected instances.  While we have not seen this failure mode on this model of UPS before, we will be inspecting all similar devices to assure other UPSs cannot malfunction in the same manner.




Point your RSS reader here for a feed of the latest messages in all forums