Discussion Forums



Thread: Massive (500) Internal Server Error.outage started 35 minutes ago

Welcome, Guest Help
Login Login


Permlink Replies: 116 - Pages: 8 [ Previous | 1 2 3 4 5 6 7 8 | Next ] - Last Post: Feb 21, 2008 1:43 PM by: sequoyan
mmc41nick

Posts: 3
Registered: 2/13/08
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 2:10 AM PST   in response to: Kathrin@AWS
  Click to reply to this thread Reply

Thanks for the update. As to your longer-term plans to handle this, your plan to provide a "service health dashboard" is a particular good idea!! However to provide a truly excellent service health solution, Amazon need to provide machine-readable management data that we can integrate in our infrastrucure so that we in turn can tell our customers what is going on! Besides machine-readable info, aws blog-updates, RSS feeds and email notifications of major service health issues is a must! In addition customers like us need to be able to setup a error page redirect for if EC2 is down (so that users trying to access a web server hosted on EC2 will get a decent error if your normal EC2 infrastructure is down).

BTW: Our company's biggest complaint is not that your servers where down but that we had to do quite some detective work to find out why (an error on our own servers?, a recent code update we did ? or the amazon aws itself ?). In this case lack of information from Amazon cost us more trouble/money that the actually outage. It is simply NOT good enough that you require your customers to browse through forum threads to find out what is wrong. There was NO info you your front page, no email notifications, no updates on your blog.


brkonthru

Posts: 28
Registered: 1/4/08
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 3:58 AM PST   in response to: mmc41nick
  Click to reply to this thread Reply

The engineering team has done a wonderful job of creating all of AWS.

But now I think its time that the AWS team include some marketing/MBA type of people who think of communication and support with customers as a prioirtiy.

We really need:

- A Health Dashboard
- A product roadmap
- Phone support service (even if paid)


starnum

Posts: 7
Registered: 2/3/07
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 5:00 AM PST   in response to: mmc41nick
  Click to reply to this thread Reply

I coulnd't be more agree with mmc41nick. As I said before, you need to be transparent with your customers. No service can provide 100% uptime. It's a fact. No matter if u have a redundant anycast network or supercalifragilisticexpialidocious elastic clouds. I just want to get notified and know what's exactly happening. Nothing else. That said, the issue was resolved very fast, so you should be very proud. Hats off to Amazon's IT staff.

mkrigsman1

Posts: 2
Registered: 2/15/08
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 8:25 AM PST   in response to: maxcc
  Click to reply to this thread Reply

So this was caused by a couple of AWS customers using the system in an unexpected manner?

Michael Krigsman
ZDNet IT Project Failures blog
http://blogs.zdnet.com/projectfailures/?p=602

Allen

Posts: 5,320
Registered: 3/19/07
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 8:44 AM PST   in response to: mkrigsman1
  Click to reply to this thread Reply

Michael,

Kathrin didn't say that.  The manner of use is both expected and supported.  What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT's and GET's of private files which require cryptographic credentials, rather than GET's of public files that require no credentials).  As I understand what Kathrin said, the surge was caused by several large customers suddenly and unexpectedly increasing their usage.  Perhaps they all decided to go live with a new service at around the same time, although this is not clear.  What is clear however is that S3 was the momentary victim of its own success, but the problem was quickly rectified.

Hopefully in the future Amazon will have a way of throttling unexpected surges so they do not adversely impact existing customers.



tastyeng

Posts: 3
Registered: 5/6/06
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 10:24 AM PST   in response to: Kathrin@AWS
  Click to reply to this thread Reply

Kathrin,
Thank you for the detailed post-mortem. It sounds like, contrary to Amazon's design goals, the authentication service is vulnerable to being a single point of failure, so I'm glad to hear Amazon will work on shoring that up against a similar scenario happening in the future.

A very big thanks for the upcoming service health dashboard.

Best wishes,
Michael

Allen

Posts: 5,320
Registered: 3/19/07
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 11:53 AM PST   in response to: tastyeng
  Click to reply to this thread Reply

> It sounds like, contrary to Amazon's design goals, theauthentication service is vulnerable to being a single point of failure

The S3 authentication service is a highly distributed system composed of many redundant nodes.  Calling it a "single point of failure" is like calling the global DNS system a single point of failure--if the entire DNS system fails due to overload, the internet will be unusable, but that is not even close to what the term "single point of failure" means.



enomaly

Posts: 444
Registered: 9/3/06
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 2:08 PM PST   in response to: Allen
  Click to reply to this thread Reply

"the problem however was a sudden unexpected surge in aparticular type of usage (PUT's and GET's of private files whichrequire cryptographic credentials"

Gee, I hope we didn't bring down S3 with our elasticdrive software which makes extensive use of cryptographic credentials, we've seen a major up tick in usage lately. (Specially yesterday)

Reuven
http://www.elasticdrive.com




Allen

Posts: 5,320
Registered: 3/19/07
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 2:58 PM PST   in response to: enomaly
  Click to reply to this thread Reply

> Gee, I hope we didn't bring down S3 with our elasticdrive softwarewhich makes extensive use of cryptographic credentials, we've seen amajor up tick in usage lately. (Specially yesterday)

Do they not care about their data, or did you sell them on false promises?


enomaly

Posts: 444
Registered: 9/3/06
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 3:13 PM PST   in response to: Allen
  Click to reply to this thread Reply

looks like no-one lost any data, does make a case for why you'd want to use nirvanix as well as s3.

r/c


Tommy

Posts: 16
Registered: 1/30/08
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 3:39 PM PST   in response to: mmc41nick
  Click to reply to this thread Reply

"aws blog-updates, RSS feeds and email notifications of major service health issues is a must"
i agree


A. Martin
RealName(TM)

Posts: 2
Registered: 2/16/08
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 3:39 PM PST   in response to: enomaly
  Click to reply to this thread Reply

way to spam the board
sage!


A. Martin
RealName(TM)

Posts: 2
Registered: 2/16/08
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 3:40 PM PST   in response to: enomaly
  Click to reply to this thread Reply

sage


Thorsten von Eicken
RealName(TM)


Posts: 633
Registered: 3/24/06
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 7:11 PM PST   in response to: mkrigsman1
  Click to reply to this thread Reply

Michael, I love how you write "a couple of customers". Sounds like "hey, some dude actually used S3 and immediately brought it down with a few requests from his home machine on a simple DSL". Are you from vulture central?

2-1/2 hours downtime is no good, but we all know 100% uptime doesn't exist. I have some questions about why they couldn't block the account(s) causing the problem such that not everyone would be affected, or otherwise contain the issue. But looking back they did a good job at fixing things.

My #1 complaint is lack of communication. Of course a trust.amazonaws.com site is sorely needed. Barring that, they should have posted an update every 20 minutes. I'm sure they could have stated relatively early on something like "the outage is due to an overload situation, no stored data is at risk, we are bringing additional capacity online". Even if there are no material news, a simple post every 20 minutes would have made everyone feel like things were under control.

Thorsten


bd_

Posts: 376
Registered: 7/17/06
Re: Massive (500) Internal Server Error.outage started 35 minutes ago
Posted: Feb 16, 2008 7:52 PM PST   in response to: Thorsten von Ei...
  Click to reply to this thread Reply


tve64 wrote:
2-1/2 hours downtime is no good, but we all know 100% uptime doesn't exist. I have some questions about why they couldn't block the account(s) causing the problem such that not everyone would be affected, or otherwise contain the issue. But looking back they did a good job at fixing things.



My guess would be that the only place such a block could be placed would be the authentication service itself - which would mean there would be no change in the load on the service. It also might result in misleading (unauthorized vs internal service error) error responses to those customers.




Point your RSS reader here for a feed of the latest messages in all forums