|
Discussion Forums
|
Thread: Lots of internal errors on SQS over the last few days
|
|
|
Replies:
16
-
Pages:
2
[
1
2
| Next
]
-
Last Post:
Sep 4, 2008 1:11 PM
by: Paul Dowman
|
Threads:
[
Previous
|
Next
]
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Lots of internal errors on SQS over the last few days
Posted:
Sep 2, 2008 2:28 PM PDT
|
|
|
I've been seeing a *lot* of internal errors (and various other errors) on SQS over the last few days, but there's no mention on the AWS status page (
http://status.aws.amazon.com/).
Is anyone else finding the same thing?
|
|
Posts:
86
Registered:
6/2/08
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 3, 2008 4:59 AM PDT
in response to: Paul Dowman
|
|
|
We are seeing the same thing. It seems to happen a lot over the last two days.
thanks
the Ylastic team
|
|
Posts:
159
Registered:
5/30/08
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 3, 2008 7:13 AM PDT
in response to: ylastic
|
|
|
Hi,
Can you provide RequestIds for the internal errors so that we can take a look into this?
Thanks,
Michael
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 3, 2008 12:51 PM PDT
in response to: Michael@AWS
|
|
|
Well I can't really take the time to dig through the last five days worth of logs at the moment but here are two from within the last hour or so:
fd75ad5b-2440-4467-b642-6c89ff9d6ded
e90384c8-fcc1-404d-98e8-bb5089425306
The response is 500 Internal Server Error
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 3, 2008 3:07 PM PDT
in response to: Michael@AWS
|
|
|
I looked into it further and it seems that only certain types of messages are causing it. I have found a message right now that can cause it every time. I don't have time at the moment but I'll try to narrow down what the issue might be.
What's odd is that this only started a few days ago, maybe Friday if I recall correctly.
|
|
Posts:
913
Registered:
12/13/06
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 3, 2008 3:48 PM PDT
in response to: Paul Dowman
|
|
|
Hi Paul,
Thank you for providing request ids. I just sent you a private message regarding your use case. We'd like to get a little more info from you.
Regards,
Justin
|
|
Posts:
10
Registered:
4/28/08
|
|
|
Posts:
50
Registered:
7/6/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 3, 2008 5:33 PM PDT
in response to: tipmobilemirko
|
|
|
Hi Mirko,
Thanks for letting us know - can you provide the date/hour when this request occurred? We can research it in the SQS logs.
Thanks,
Joel
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 8:08 AM PDT
in response to: Paul Dowman
|
|
|
Well since AWS support haven't solved this yet and it's causing a lot of failures on my production systems I've been forced to investigate. The answer is 7891.
The problem seems to be that requests of 7891 bytes or larger cause SQS to fail.
Here's a little ruby script that shows the problem:
s3_config = YAML::load_file("#{RAILS_ROOT}/config/s3.yml")[RAILS_ENV]
sqs = RightAws::SqsGen2.new(s3_config['aws_access_key'], s3_config['aws_secret_access_key'])
q = RightAws::SqsGen2::Queue.create(sqs, "pauldowman-test", true)
# succeeds every time:
100.times do
if (q.send_message("x" * 7890) rescue false)
puts "succeeded."
else
puts "FAIL!"
end
end
# fails every time:
100.times do
if (q.send_message("x" * 7891) rescue false)
puts "succeeded."
else
puts "FAIL!"
end
end
I'm going to run this failing loop for the rest of the day in the hopes that if I cause enough 500 errors Amazon will update the status on the service health dashboard. ;-)
Seriously though, this looks like a massive SQS failure from my point of view, it's been going on for about a week now, and while I appreciate the quick response from AWS support requesting more info, they haven't solved it yet. My analysis (assuming it's correct) wasn't too difficult and I'd expect them to have come to the same conclusion much sooner.
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 8:21 AM PDT
in response to: Paul Dowman
|
|
|
Oh, and FYI, if I give a message body larger than 8192 bytes I do get an error as expected: "InvalidParameterValue: Value for parameter MessageBody is invalid. Reason: Message body must be shorter than 8192 characters"
|
|
Posts:
913
Registered:
12/13/06
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 8:41 AM PDT
in response to: Paul Dowman
|
|
|
Hi Paul,
We can confirm the problem that you are experiencing. The fix is in the final stage of testing and we will let you know when it has been fully rolled out.
Regards,
Justin
|
|
Posts:
86
Registered:
6/2/08
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 8:52 AM PDT
in response to: Paul Dowman
|
|
|
Thanks Paul for the terrific detective work. Wish the status dashboard was updated the minute AWS is aware that there is an issue, so everyone is aware of it ...
thanks
the Ylastic team
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 8:53 AM PDT
in response to: Justin@AWS
|
|
|
Great, thanks for the quick response!
I've already adjusted my max message size anyway. :-)
|
|
Posts:
32
Registered:
6/10/07
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 8:58 AM PDT
in response to: Paul Dowman
|
|
|
However, the service health dashboard still says "Service is operating normally".
|
|
Posts:
10
Registered:
4/28/08
|
|
|
|
Re: Lots of internal errors on SQS over the last few days
Posted:
Sep 4, 2008 10:16 AM PDT
in response to: Joel@AWS
|
|
|
Hi Joel,
Thanks for looking into this.
This particular request (with request id e573b4b0-656a-4e84-97fb-63d188971b0c) occurred on 9/2 at around 18:11 UTC. See the log excerpt below. The error message was received on 18:12:38, but the request expiration is stated as 18:11:08, so the request must have been submitted some time before.
My first suspicion was that the clock might be off on the instances, but I have confirmed that this is not the case.
Going through my logs, I have not found any errors since 9/2, but we also haven't send or received any messages since then, so I can't safely conclude that the problem has been fixed yet. I wonder if the problem is somehow related to specific messages. In my case, all messages are short, so I am definitely not bumping into the message length problem mentioned on this thread.
E, [2008-09-02T18:12:38.646990 #17428] ERROR -- : RequestExpired: Request has expired. Expires date is 2008-09-02T18:11:08Z. (RightAws::AwsError)
/usr/lib/ruby/gems/1.8/gems/right_aws-1.7.3/lib/awsbase/right_awsbase.rb:259:in `request_info_impl'
/usr/lib/ruby/gems/1.8/gems/right_aws-1.7.3/lib/sqs/right_sqs_gen2_interface.rb:151:in `request_info'
/usr/lib/ruby/gems/1.8/gems/right_aws-1.7.3/lib/sqs/right_sqs_gen2_interface.rb:243:in `receive_message'
/usr/lib/ruby/gems/1.8/gems/right_aws-1.7.3/lib/sqs/right_sqs_gen2.rb:163:in `receive_messages'
|
|
|
|