|
Discussion Forums
|
Thread: Unresponsive EC2 instance
 |
This question is answered.
Helpful answers available: 1.
Correct answers available: 1.
|
|
|
Posts:
84
Registered:
3/26/08
|
|
|
|
Re: Unresponsive EC2 instance
Posted:
Jun 11, 2009 2:11 AM PDT
in response to: pjanakiraman
|
 |
Helpful |
|
|
This issue has been resolved and power and connectivity has been restored to the affected instances. We have confirmed that all instances listed in this thread have been restored with the exception of one instance that was terminated by the user and a second instance that was running on a degraded host. The owner of the latter instance was contacted via email. The rest of the instances lost power and experienced a reboot. Recovered instances have both their EBS and instance stores. If you have any remaining issues or questions about your instances, please feel free to contact us on this thread.
|
|
Posts:
2,112
Registered:
7/10/08
|
|
|
|
Re: Unresponsive EC2 instance
Posted:
Jun 11, 2009 5:05 AM PDT
in response to: rtdev
|
|
|
The /mnt disk should remain intact across reboots. Perhaps it *used to* be wiped, but no longer.
|
|
Posts:
147
Registered:
3/28/08
|
|
|
Posts:
147
Registered:
3/28/08
|
|
|
|
Re: Unresponsive EC2 instance
Posted:
Jun 11, 2009 6:07 AM PDT
in response to: pjanakiraman
|
|
|
Luke and all - It is disappointing to see a failure like this but I think I speak for many when I say it is great see how responsive the Amazon team was at resolving this.
That being said - I was under the impression that your architecture had more resiliency built into it. Yes we can use multiple availability zones to help with a single point of failure, but I thought that even within a single availability zone there was not a single point of failure for hardware / power. Can you please let us know if this will be addressed to prevent such issues in the future?
Also can you elaborate on whether the instances themselves lost power or if it was just the equipment that connects the instances to the net? It was unclear from the Dashboard update. If it was the instances themselves that lost power, than I was surprised they could be recovered with /mnt still in tack (I thought that /mnt and all non-persistent parts of an image would be lost should a box lose power, no?)
Also special thanks to your team for the frequent updates on the dashboard. In the future it would be helpful if a rough estimate of ETA can be provided - for instance even a "it could be several hours" or "its possible things will be restored within a couple of hours" type of messages help us to plan. Lastly I don't know if you've thought of this but it would be great if Amazon used twitter to keep us posted. Obviously when everyone is hard at work to bring a system up the last thing they want is to take time away from fixing the issue to talk about it, but some occassional quick blurbs like "the tech just arrived - replacing bad parts" would be incredible updates to have during an outage.
Thanks again.
|
|
Posts:
71
Registered:
7/17/08
|
|
|
|
Re: Unresponsive EC2 instance
Posted:
Jun 11, 2009 7:27 AM PDT
in response to: Luke@AWS
|
|
|
Luke,
Sorry to be a pain, but you guys still haven't answered my question with regards to PDU redundancy. Going to be doing a post-mortem on this and need to explain to my management team why our "cloud" was taken out by "lightning".
Thanks
Chris
|
|
Posts:
36
Registered:
4/22/07
|
|
|
|
Re: Unresponsive EC2 instance
Posted:
Jun 11, 2009 10:47 AM PDT
in response to: pjanakiraman
|
|
|
I'd like to hear more what happened as well. How does a single PDU have redundancy? What was it doing controlling 2+ racks? This is the only "support" I have I don't pay extra for Amazon's"support", But thanks for calling me and asking me to sign up for it.
This forum and the stat page is all we get for our $2k a month amazon hosting.
Message was edited by: C. Oliver
|
|
Posts:
158
Registered:
5/30/08
|
|
|
|
Re: Unresponsive EC2 instance
Posted:
Jun 11, 2009 6:29 PM PDT
in response to: filife
|
|
|
We wanted to provide some additional clarity in response to the questions around the power loss that affected some instances yesterday. We have redundant power in our datacenters, which performed as designed when we lost power in the lightning storm, except for a single component powering the affected hosts. Within each of our Availability Zones there are many Power Distribution Units ("PDUs") that independently provide redundant power to subsets of hosts in that zone. The PDUs use Uninterruptible Power Supplies ("UPSs'") to smoothly switch between utility and backup power without interruption. When this Availability Zone lost utility power during yesterday's storm, our power failover worked as expected, and all but one of the PDUs successfully switched to backup power with no impact. However, one UPS on one PDU sustained damage during the storm, which prevented it from properly switching power sources; thus the PDU to which it was attached lost power. Because of the nature of the failure and the design of the particular UPS, we were unable to manually switch power over immediately. This delayed our ability to bring power back to the affected hosts and to reboot the affected instances. While we have not seen this failure mode on this model of UPS before, we will be inspecting all similar devices to assure other UPSs cannot malfunction in the same manner.
|
|
|
|