Discussion Forums



Thread: SQS is way too unreliable, what's going on?

Welcome, Guest Help
Login Login


Permlink Replies: 19 - Pages: 2 [ 1 2 | Next ] - Last Post: Sep 22, 2008 5:36 PM by: beanie4242 Threads: [ Previous | Next ]
Paul Dowman
RealName(TM)


Posts: 32
Registered: 6/10/07
SQS is way too unreliable, what's going on?
Posted: Sep 16, 2008 7:11 AM PDT
  Click to reply to this thread Reply

I'm still getting a lot of InternalError responses, and the status pagesays "operating normally". From my perspective SQS has been half brokenfor weeks now, and it's causing me a lot of problems. (For more details see the following threads: http://developer.amazonwebservices.com/connect/thread.jspa?threadID=24350&tstart=0 and http://developer.amazonwebservices.com/connect/thread.jspa?threadID=24554&tstart=0 )

This is nowhere near the kind of reliability I need from a service that I'm using as part of a production app. Can we get some sort of statement on what's going on? Without some kind of assurance that this will be resolved very soon I can't continue to use it.



Justin@AWS

Posts: 913
Registered: 12/13/06
Re: SQS is way too unreliable, what's going on?
Posted: Sep 16, 2008 10:55 AM PDT   in response to: Paul Dowman
  Click to reply to this thread Reply

Hi Paul,

We're not seeing any issues with Amazon SQS right now.  Could you please provide us with some additional information about the issues you are currently seeing?  Are you seeing problems with a specific request type, or to a specific queue?  Are errors consistently high, or are you seeing an occasional 500?  When you do receive a 500, are you retrying with exponential back-off?

Regards,
Justin


Ryan Angilly

Posts: 23
Registered: 4/14/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 16, 2008 5:45 PM PDT   in response to: Justin@AWS
  Click to reply to this thread Reply

Was this conversation taken offline? After the relatively short response times from AWS admins in the other threads, I'm shocked it's been 5 hours since this has been replied to.


In other news, a series of tests I wrote 4 months ago, and never had problems with, are now failing reliably. The tests get pissed off when comparing queue sizes. I add a message, check the size, size hasn't changed. Later on, it will have changed by 2 or 3.

So I will echo Paul's comments: What is going on, and why doesn't the status page reflect these issues?

JoeJ@AWS

Posts: 158
Registered: 5/30/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 16, 2008 6:38 PM PDT   in response to: Ryan Angilly
  Click to reply to this thread Reply

Hi Ryan,

We'd like to better understand exactly what you are seeing.  Are you using a particular library?  Can you provide us with a request-id and timestamp that relates your testing?


Ryan Angilly

Posts: 23
Registered: 4/14/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 16, 2008 6:57 PM PDT   in response to: JoeJ@AWS
  Click to reply to this thread Reply

Hi Joe,

I'm using RightAWS's ruby library. I just ran the test and it inserted a msg with this ID:

436f77c7-448b-41eb-a2e6-4d0f1bca100f

at 1221616072 seconds GMT

1 second later the queue size was still returning as 0. I've actually found that if I sleep for a few seconds between insertion and testing the size, the test now passes, but I have never had to do this before.

JoeJ@AWS

Posts: 158
Registered: 5/30/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 16, 2008 7:44 PM PDT   in response to: Ryan Angilly
  Click to reply to this thread Reply

Hi Ryan,

This is in line with the eventual consistency model that SQS uses.  You can read more about eventual consistency here:

http://developer.amazonwebservices.com/connect/thread.jspa?messageID=74385
http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

Regards,

JoeJ


Ryan Angilly

Posts: 23
Registered: 4/14/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 17, 2008 7:25 AM PDT   in response to: JoeJ@AWS
  Click to reply to this thread Reply

Hey Joe,

I understand that SQS is based on eventual consistency, but without any kind of guidelines, how do we properly test this stuff? Is there a published inconsistency window for SQS? From those two links, it seems like the inconsistency window could be 30 seconds or a day. Even the former begins to get difficult to test -- just having a few dozen tests that need to wait that long can make a test suite annoyingly long.

What do you think?

Thanks,
Ryan

Paul Dowman
RealName(TM)


Posts: 32
Registered: 6/10/07
Re: SQS is way too unreliable, what's going on?
Posted: Sep 17, 2008 8:10 AM PDT   in response to: Justin@AWS
  Click to reply to this thread Reply

Hi Justin,

I don't think it's specific to a specific request type or queue. I'm mostly referring to the acknowledged issues in the previously mentioned threads (for example the bug with queueing large messages and the series of issues from September 11 to 14), but at the time I wrote that message I was again experiencing 500 errors for a period of several minutes.

I'm using the RightAws ruby library which does do the exponential back-off, I have it set to 5 retries, so by the time I see the error it has failed 5 times.

I guess I need to record some metrics about the number of requests that succceed or fail.



Ryan Angilly

Posts: 23
Registered: 4/14/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 17, 2008 10:12 AM PDT   in response to: Paul Dowman
  Click to reply to this thread Reply

Hey Paul,

FYI, the RightAWS library does keep track of failures and retries in it's logs. Grep the logs for 'request failure count'.

Ronald Reeser

Posts: 15
Registered: 10/9/07
Re: SQS is way too unreliable, what's going on?
Posted: Sep 22, 2008 12:44 PM PDT   in response to: Paul Dowman
  Click to reply to this thread Reply

I'm getting a ton of SQS errors this afternoon - as well what appeared to be a complete outage from 3-3:30 EDT.


beanie4242

Posts: 21
Registered: 10/4/07
Re: SQS is way too unreliable, what's going on?
Posted: Sep 22, 2008 12:54 PM PDT   in response to: Ronald Reeser
  Click to reply to this thread Reply

Dashboard only says "recovered".  What's going on!!!



Ravi@AWS

Posts: 152
Registered: 3/7/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 22, 2008 12:56 PM PDT   in response to: Ronald Reeser
  Click to reply to this thread Reply

These times correspond to the an SQS update on the AWS Service Health Dashboard. Please refer to: http://status.aws.amazon.com/ .

Regards,
Ravi



beanie42

Posts: 1
Registered: 10/4/07
Re: SQS is way too unreliable, what's going on?
Posted: Sep 22, 2008 1:25 PM PDT   in response to: Ravi@AWS
  Click to reply to this thread Reply

Ravi - that's my point.  I referred to the dashboard, and the information it gives is only "SQS crashed again".  I'd like to know why.  More specifically, I'd like to know why it went down today after repeated problems over the last number of weeks (which I thought would have led to a more stable SQS).  My question is not if SQS was down, but whether it's suitable to continue using in production....

Kenneth Cheung

Posts: 82
Registered: 2/15/08
Re: SQS is way too unreliable, what's going on?
Posted: Sep 22, 2008 1:50 PM PDT   in response to: beanie42
  Click to reply to this thread Reply


To me, all the recent SQS problems look awfully like capacity and/or
scalability issues.

If you notice, all the problems started around late morning Pacific time,
and all recovered by late afternoon/early evening.




Stevie Clifton

Posts: 56
Registered: 9/4/07
Re: SQS is way too unreliable, what's going on?
Posted: Sep 22, 2008 3:51 PM PDT   in response to: Kenneth Cheung
  Click to reply to this thread Reply

+1 for more details.

During the ~20 minutes SQS was having problems today, on our side it looked more like an outage than "elevated error rates".  When error rates are elevated, it should mean that with ample retries (which we have built-in to our system), subsequent requests should succeed.  This was not the case today, as we couldn't make a single SQS request during that period.

It was encouraging to see that it had been reported on the dashboard, which let me know that you were aware of the problem and were tackling it.  But it did have a huge impact on our systems during the time, and a simple "Error rates have recovered" message is not particularly helpful.

Could we get some more detailed information?  What caused the problem?  What's being done to prevent similar issues in the future?  Did it affect all customers equally, or was it limited to certain customers?  If other customers were seeing what we were seeing, than I think "outage" would be a much more accurate term for what happened today, at least from our POV.

Thanks,

stevie



Point your RSS reader here for a feed of the latest messages in all forums