Discussion Forums



Thread: EC2 API outage

Welcome, Guest Help
Login Login


Permlink Replies: 40 - Pages: 3 [ Previous | 1 2 3 | Next ] - Last Post: Oct 4, 2008 10:59 AM by: Colin Rhodes
Brian

Posts: 67
Registered: 7/25/07
Re: EC2 API outage
Posted: Sep 29, 2007 8:50 AM PDT   in response to: Attila@AWS
  Click to reply to this thread Reply

Priceless

Dan Kearns

Posts: 11
Registered: 1/25/07
Re: EC2 API outage
Posted: Sep 29, 2007 9:11 AM PDT   in response to: James@AWS
  Click to reply to this thread Reply

Hello,

my instance i-38cc2f51 disappeared this morning as well. any hope for it?


Thorsten von Eicken
RealName(TM)


Posts: 633
Registered: 3/24/06
Outage follow-up
Posted: Sep 29, 2007 9:21 AM PDT   in response to: James@AWS
  Click to reply to this thread Reply

Could you guys post some details about the outage? Some of the questions I have are:

1. were the instances that were lost during the outage localized? As in physically localized? Or was it a "random" set of instances across EC2? The reason for the question is failure isolation: if I had 2 servers that were in different "parts" of EC2 did I have a high chance of not loosing both?

2. were all EC2 API front-ends affected, or only some?

3. did instances go down at the same time as the API front-ends died? Was there a window where an automatic system could have restarted instances before everything went down?

Thanks much,
Thorsten - www.rightscale.com


M. KOGLIN
RealName(TM)

Posts: 1
Registered: 1/12/07
Re: Outage follow-up
Posted: Sep 29, 2007 10:30 AM PDT   in response to: Thorsten von Ei...
  Click to reply to this thread Reply

One of my instances, i-38866651, has not been accessible this morning via HTTP, Ping or SSH. I am unable to bundle an image. I fear rebooting the instance. Was this also affected by the outage? Any hope that I can recover the data?

I'm also interested in hearing the answers to Thorsten's questions (above).

Regards,
Matt Koglin


enomaly

Posts: 444
Registered: 9/3/06
Re: EC2 API outage
Posted: Sep 29, 2007 10:40 AM PDT   in response to: Brian
  Click to reply to this thread Reply

To be blunt, this scares the hell out of me. What kind of redundancy does the current EC2 API have to avoid this from happening again? Does EC2 practice what it preaches and use SQS or some other queue service?

Reuven
http://www.enomalylabs.com


mediawonder

Posts: 47
Registered: 9/29/06
Re: EC2 API outage
Posted: Sep 29, 2007 10:50 AM PDT   in response to: Attila@AWS
  Click to reply to this thread Reply

Can you please check my instance i-e0bc5b89
Can not reach by ssh. 

Much appreciate all the help
MW



Peter@AWS

Posts: 147
Registered: 6/20/06
Re: EC2 API outage
Posted: Sep 29, 2007 12:04 PM PDT   in response to: mediawonder
  Click to reply to this thread Reply


This is an update on the EC2 issues experienced today.  A software deployment caused our management software to erroneously terminate a small number of user’s instances.  When our monitoring detected this issue, the EC2 management software and APIs were disabled to prevent further terminations.  Once we corrected the problem, we restored the management software.

We will contact users that lost instances directly by email.  At this point,  the service is fully functional, and you should be able to launch replacement instances immediately.

While we have corrected the immediate bug, we are also adding additional checks to prevent this sort of issue from recurring in the future.

We are aware of the following outstanding issues which we are working to resolve now:
1/ Some instances may get stuck in the "shutting down" state until we have completed our clean-up.  These instances will not be billed and will be fully terminated shortly.
2/ Some instances will not show their launch indexes in describe-instances API.
We will keep you posted as we resolve these remaining issues.

To address a few of the questions posed on this thread:

The availability of the EC2 APIs is very important and it remains our goal to keep them highly available.  We believe disabling the management software was the correct decision because of the risk to running instances.  This is not a decision we take lightly, and we will work to avoid having to make this choice in the future.

There was no correlation between the instance terminations, so users with redundancy built into their instance deployments would have been better able to deal with the terminations.  We also understand that failure isolation is very important and we are hard at work on additional functionality to help with this.

Please let us know if you experience any unexpected behavior.
The Amazon EC2 Team




"shpadoinkal"

Posts: 3
Registered: 6/7/07
Re: EC2 API outage
Posted: Sep 29, 2007 12:28 PM PDT   in response to: Peter@AWS
  Click to reply to this thread Reply

We were not notified...


Thorsten von Eicken
RealName(TM)


Posts: 633
Registered: 3/24/06
Re: EC2 API outage
Posted: Sep 29, 2007 12:29 PM PDT   in response to: Peter@AWS
  Click to reply to this thread Reply

Thanks for the explanation, makes a lot of sense. Best wishes for a calm week-end!
    Thorsten



ameranthw

Posts: 1
Registered: 9/29/07
Re: EC2 API outage
Posted: Sep 29, 2007 12:54 PM PDT   in response to: James@AWS
  Click to reply to this thread Reply

Peter (and folks at AWS),

I had an instance running for the last couple weeks or so. Since I started it so long ago, I have since forgotten what the instance id was. It was running on 67.202.1.140. I am not sure if that will help you locate the instance id. I need to know what happened to this instance.

1. Is it still running somewhere and ec2-describe-instances isn't displaying it?? Can you find the instance id and track down the time that it went down (assuming it is dead).
2. Is all data lost? I imagine so, which is a real bummer as most of it is unrecoverable.
3. Is there any way for you to cause a hold state on instances when an outage occurs? This is really frightening since we had a big project planned for hosting on the AWS system and can not at this point be confident that AWS will work for us. Even if we were doing regular backups, recovering from an outage like this would take way too long to do.

When I first noticed the problem this morning, I didn't realize it was AWS wide so I went to my terminal and did the ec2-describe-instances and got NOTHING reported. So I assumed something bad happened to my server instance and so I attempted to start over again and realized that all my data subsequent to creating the AMI is now gone. I terminated that new instance within minutes of starting it up since that is not what I want running.

Please tell me that this system will be more reliable than this in the future.

Thanks,
Andrew


Bruno Bornsztein
RealName(TM)

Posts: 4
Registered: 11/28/06
Re: EC2 API outage
Posted: Sep 29, 2007 2:44 PM PDT   in response to: ameranthw
  Click to reply to this thread Reply

I was notified that my instance (i-f1d93e98) went down. Is there anything I can do to recover it, rather than having to launch a new instance?


ian@aws

Posts: 371
Registered: 7/17/07
Re: EC2 API outage
Posted: Sep 29, 2007 3:01 PM PDT   in response to: Bruno Bornsztein
  Click to reply to this thread Reply

We're sorry. But the only way to recover is to launch a new instance.

Attila@AWS

Posts: 162
Registered: 11/2/06
Re: EC2 API outage
Posted: Sep 29, 2007 3:04 PM PDT   in response to: Bruno Bornsztein
  Click to reply to this thread Reply

Hi everyone, if you received an email notification from us about an instance that was erroneously terminated today, there is nothing we can do to resurrect the instance or recover its data unfortunately.



David@AWS

Posts: 177
Registered: 8/2/07
Re: EC2 API outage
Posted: Sep 29, 2007 3:09 PM PDT   in response to: ameranthw
  Click to reply to this thread Reply

Hi Andrew,

I have private messaged you with the details of your instance.

Regards,
David


stinger67

Posts: 30
Registered: 8/2/07
Re: EC2 API outage
Posted: Sep 29, 2007 3:40 PM PDT   in response to: Attila@AWS
  Click to reply to this thread Reply

Ok then!


I understand the instance outage issue along with the data lost
and no recovery for the instances which have been terminated.

As a request could you please explain the procedure for having
another instance available as a backup but not running in
clear and concise terminology.

I understand we are responsible for our own redundant processes.
Just need some clarification on how to make that happen w/o
the cost of run-time.

(I'm a bit miffed as to why an instance state was not written somewhere
in the farm but I'm not going to go there :))

Thanks for the response



Point your RSS reader here for a feed of the latest messages in all forums