|
Discussion Forums
|
Thread: SQS is way too unreliable, what's going on?
|
|
|
Replies:
19
-
Pages:
2
[
1
2
| Next
]
-
Last Post:
Sep 22, 2008 5:36 PM
by: beanie4242
|
Threads:
[
Previous
|
Next
]
|
|
Posts:
32
Registered:
6/10/07
|
|
|
Posts:
913
Registered:
12/13/06
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 16, 2008 10:55 AM PDT
in response to: Paul Dowman
|
|
|
Hi Paul,
We're not seeing any issues with Amazon SQS right now. Could you please provide us with some additional information about the issues you are currently seeing? Are you seeing problems with a specific request type, or to a specific queue? Are errors consistently high, or are you seeing an occasional 500? When you do receive a 500, are you retrying with exponential back-off?
Regards,
Justin
|
|
Posts:
23
Registered:
4/14/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 16, 2008 5:45 PM PDT
in response to: Justin@AWS
|
|
|
Was this conversation taken offline? After the relatively short response times from AWS admins in the other threads, I'm shocked it's been 5 hours since this has been replied to.
In other news, a series of tests I wrote 4 months ago, and never had problems with, are now failing reliably. The tests get pissed off when comparing queue sizes. I add a message, check the size, size hasn't changed. Later on, it will have changed by 2 or 3.
So I will echo Paul's comments: What is going on, and why doesn't the status page reflect these issues?
|
|
Posts:
158
Registered:
5/30/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 16, 2008 6:38 PM PDT
in response to: Ryan Angilly
|
|
|
Hi Ryan,
We'd like to better understand exactly what you are seeing. Are you using a particular library? Can you provide us with a request-id and timestamp that relates your testing?
|
|
Posts:
23
Registered:
4/14/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 16, 2008 6:57 PM PDT
in response to: JoeJ@AWS
|
|
|
Hi Joe,
I'm using RightAWS's ruby library. I just ran the test and it inserted a msg with this ID:
436f77c7-448b-41eb-a2e6-4d0f1bca100f
at 1221616072 seconds GMT
1 second later the queue size was still returning as 0. I've actually found that if I sleep for a few seconds between insertion and testing the size, the test now passes, but I have never had to do this before.
|
|
Posts:
158
Registered:
5/30/08
|
|
|
Posts:
23
Registered:
4/14/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 17, 2008 7:25 AM PDT
in response to: JoeJ@AWS
|
|
|
Hey Joe,
I understand that SQS is based on eventual consistency, but without any kind of guidelines, how do we properly test this stuff? Is there a published inconsistency window for SQS? From those two links, it seems like the inconsistency window could be 30 seconds or a day. Even the former begins to get difficult to test -- just having a few dozen tests that need to wait that long can make a test suite annoyingly long.
What do you think?
Thanks,
Ryan
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 17, 2008 8:10 AM PDT
in response to: Justin@AWS
|
|
|
Hi Justin,
I don't think it's specific to a specific request type or queue. I'm mostly referring to the acknowledged issues in the previously mentioned threads (for example the bug with queueing large messages and the series of issues from September 11 to 14), but at the time I wrote that message I was again experiencing 500 errors for a period of several minutes.
I'm using the RightAws ruby library which does do the exponential back-off, I have it set to 5 retries, so by the time I see the error it has failed 5 times.
I guess I need to record some metrics about the number of requests that succceed or fail.
|
|
Posts:
23
Registered:
4/14/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 17, 2008 10:12 AM PDT
in response to: Paul Dowman
|
|
|
Hey Paul,
FYI, the RightAWS library does keep track of failures and retries in it's logs. Grep the logs for 'request failure count'.
|
|
Posts:
15
Registered:
10/9/07
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 22, 2008 12:44 PM PDT
in response to: Paul Dowman
|
|
|
I'm getting a ton of SQS errors this afternoon - as well what appeared to be a complete outage from 3-3:30 EDT.
|
|
Posts:
21
Registered:
10/4/07
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 22, 2008 12:54 PM PDT
in response to: Ronald Reeser
|
|
|
Dashboard only says "recovered". What's going on!!!
|
|
Posts:
152
Registered:
3/7/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 22, 2008 12:56 PM PDT
in response to: Ronald Reeser
|
|
|
These times correspond to the an SQS update on the AWS Service Health Dashboard. Please refer to:
http://status.aws.amazon.com/ .
Regards,
Ravi
|
|
Posts:
1
Registered:
10/4/07
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 22, 2008 1:25 PM PDT
in response to: Ravi@AWS
|
|
|
Ravi - that's my point. I referred to the dashboard, and the information it gives is only "SQS crashed again". I'd like to know why. More specifically, I'd like to know why it went down today after repeated problems over the last number of weeks (which I thought would have led to a more stable SQS). My question is not if SQS was down, but whether it's suitable to continue using in production....
|
|
Posts:
82
Registered:
2/15/08
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 22, 2008 1:50 PM PDT
in response to: beanie42
|
|
|
To me, all the recent SQS problems look awfully like capacity and/or
scalability issues.
If you notice, all the problems started around late morning Pacific time,
and all recovered by late afternoon/early evening.
|
|
Posts:
56
Registered:
9/4/07
|
|
|
|
Re: SQS is way too unreliable, what's going on?
Posted:
Sep 22, 2008 3:51 PM PDT
in response to: Kenneth Cheung
|
|
|
+1 for more details.
During the ~20 minutes SQS was having problems today, on our side it looked more like an outage than "elevated error rates". When error rates are elevated, it should mean that with ample retries (which we have built-in to our system), subsequent requests should succeed. This was not the case today, as we couldn't make a single SQS request during that period.
It was encouraging to see that it had been reported on the dashboard, which let me know that you were aware of the problem and were tackling it. But it did have a huge impact on our systems during the time, and a simple "Error rates have recovered" message is not particularly helpful.
Could we get some more detailed information? What caused the problem? What's being done to prevent similar issues in the future? Did it affect all customers equally, or was it limited to certain customers? If other customers were seeing what we were seeing, than I think "outage" would be a much more accurate term for what happened today, at least from our POV.
Thanks,
stevie
|
|
|
|