|
Discussion Forums
|
Thread: S3 data corruption?
|
|
|
Replies:
21
-
Pages:
2
[
1
2
| Next
]
-
Last Post:
Jul 3, 2008 1:36 PM
by: James Eitzmann
|
|
|
Posts:
6
Registered:
2/8/06
|
|
|
|
S3 data corruption?
Posted:
Jun 22, 2008 5:05 PM PDT
|
|
|
we are having some
serious
S3 issues.
all data we store on S3 has gone through the same code path for months. starting a couple days ago a small percentage of the objects we are retrieving are not checksumming to the correct values. we hash and store objects by checksum and rehash the objects when we retrieve to ensure there is no data corruption. all the objects we're having issues with were uploaded at approximately the same time period a few days ago.
we've stored 10's of millions of objects in S3 and never encountered such problems. please let me know ASAP if you have any idea what could be going on here. thanks.
|
|
Posts:
515
Registered:
9/20/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 22, 2008 5:11 PM PDT
in response to: Arash Ferdowsi
|
|
|
Hi Arash,
I can take a look at this for you. While I do this, can you please send to aws@amazon.com the Bucket-Name and few keys that you believe are having issues?
Thanks,
Ramkumar
|
|
Posts:
21
Registered:
10/4/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 22, 2008 6:29 PM PDT
in response to: Arash Ferdowsi
|
|
|
I'm having similiar problems. Same code base for months, and we started experiencing problems somewhere around 48 hours ago. I've been investigating our end to find the problem, and it was just suggested that I should check the forums to see if anyone else was having problems.
I will PM some keys as suggested. This is super-high priority for us (both corporately and personally, since lack of sleep dealing with this is killing me...)!
|
|
Posts:
24
Registered:
4/25/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 22, 2008 9:15 PM PDT
in response to: Ramkumar@AWS
|
|
|
Same thing here. I've only noticed it with one object. It was uploaded on 2008-06-21 22:21:11.
In doing a binary compare with the original file, there are only bytes here and there that are different.
I'll PM you further details.
|
|
Posts:
278
Registered:
4/10/06
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 22, 2008 9:46 PM PDT
in response to: Arash Ferdowsi
|
|
|
Did you specify Content-MD5 when you PUT the data? If not, it's possible that the files were corrupted over the network rather than in storage.
And it's interesting that this comes up at the same time as the HTTPS issues. A flaky network interface or load balancer which caused data corruption could cause both of these.
|
|
Posts:
163
Registered:
2/8/06
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 12:59 AM PDT
in response to: Colin Percival
|
|
|
A quick note to let you know that we’re continuing to investigate these reports. The earliest customer report occurred for an object that was uploaded at 11:54pm PDT on Friday, June 20th, and the latest occurred at 5:12am PDT on Sunday, June 22nd. We’ll post an update when we have further information.
Thanks,
Kathrin
|
|
Posts:
56
Registered:
9/4/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 5:27 AM PDT
in response to: Kathrin@AWS
|
|
|
Just wanted to throw in that we're seeing the same problem as well. We started seeing errors with parsing XML files we were storing on S3 at 2:09am EST on Saturday. In every case, a rogue character would replace another character within the XML, resulting in unparsable XML. I can send an example if that would help.
Also, the last time we can confirm a corrupted XML was on Sunday at 10:54am. We haven't seen any problems since, but can't confirm there hasn't been subsequent data corruption.
Thanks,
stevie
|
|
Posts:
5,320
Registered:
3/19/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 6:28 AM PDT
in response to: Kathrin@AWS
|
|
|
Hello Kathrin,
Actually, the first report of such an anomaly came from me on June 12. I have yet to receive a response to the email I sent.
That part that bothers me (in addition to not getting a reply to my email) is that this sort of undetected error should be impossible. S3 should be storing the object's original MD5 along with the object, and sending this value back when the object is retrieved, so any corruption in storage or transit should be immediately detectable. It appears however that S3 is not storing the object's original MD5, and therefore has no idea if the object has become corrupt; its just blindly sending back whatever. This type of oversight in what is supposed to be a highly-reliable storage system is in my opinion unforgivable and must be remedied immediately.
Allen
|
|
Posts:
163
Registered:
2/8/06
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 10:29 AM PDT
in response to: Allen
|
|
|
When Amazon S3 receives a PUT request with the Content-MD5 header, Amazon S3 computes the MD5 of the object received and returns a 400 error if it doesn't match the MD5 sent in the header. Looking at our service logs from the period between 6/20 11:54pm PDT and 6/22 5:12am PDT, we do see a modest increase in the number of 400 errors. This may indicate that there were elevated network transmission errors somewhere between the customer and Amazon S3. We are continuing to investigate and will post an update when we have further information.
Thanks,
Kathrin
|
|
Posts:
5,320
Registered:
3/19/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 11:12 AM PDT
in response to: Kathrin@AWS
|
|
|
Hello Kathrin,
As I mentioned in the email I sent, the PUT included a Content-MD5 header. Later, then the object was retrieved, S3 sent the object back with a different Content-MD5. Network errors would not account for that. It is not clear if the other users who posted to this thread also sent a Content-MD5 header when they PUT their objects, and if they checked the Content-MD5 that was sent back by S3, but if they did, transmission errors would again not account for these problems.
Thank you,
Allen
|
|
Posts:
5,320
Registered:
3/19/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 12:44 PM PDT
in response to: Kathrin@AWS
|
|
|
Amazon has finally (11 days later) responded to my email and it appears what I observed was an eventual consistency anomaly. That is what I suspected, but Amazon's lack of response to my email (despite stating they would investigate), and these additional reports were certainly cause for concern.
Kathrin, could you please confirm that an object's original MD5 is stored along with the object so that corruption during storage will not go undetected.
Thank you,
Allen
|
|
Posts:
163
Registered:
2/8/06
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 6:10 PM PDT
in response to: Kathrin@AWS
|
|
|
We've isolated this issue to a single load balancer that was brought into service at 10:55pm PDT on Friday, 6/20. It was taken out of service at 11am PDT Sunday, 6/22. While it was in service it handled a small fraction of Amazon S3's total requests in the US. Intermittently, under load, it was corrupting single bytes in the byte stream. When the requests reached Amazon S3, if the Content-MD5 header was specified, Amazon S3 returned an error indicating the object did not match the MD5 supplied. When no MD5 is specified, we are unable to determine if transmission errors occurred, and Amazon S3 must assume that the object has been correctly transmitted. Based on our investigation with both internal and external customers, the small amount of traffic received by this particular load balancer, and the intermittent nature of the above issue on this one load balancer, this appears to have impacted a very small portion of PUTs during this time frame.
One of the things we'll do is improve our logging of requests with MD5s, so that we can look for anomalies in their 400 error rates. Doing this will allow us to provide more proactive notification on potential transmission issues in the future, for customers who use MD5s and those who do not. In addition to taking the actions noted above, we encourage all of our customers to take advantage of mechanisms designed to protect their applications from incorrect data transmission. For all PUT requests, Amazon S3 computes its own MD5, stores it with the object, and then returns the computed MD5 as part of the PUT response code in the ETag. By validating the ETag returned in the response, customers can verify that Amazon S3 received the correct bytes even if the Content MD5 header wasn't specified in the PUT request. Because network transmission errors can occur at any point between the customer and Amazon S3, we recommend that all customers use the Content-MD5 header and/or validate the ETag returned on a PUT request to ensure that the object was correctly transmitted. This is a best practice that we'll emphasize more heavily in our documentation to help customers build applications that can handle this situation.
If you have specific questions or concerns about how your application might have been affected, please feel free to e-mail us at aws@amazon.com.
|
|
Posts:
278
Registered:
4/10/06
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 6:58 PM PDT
in response to: Kathrin@AWS
|
|
|
Kathrin,
Was the load balancer corrupting data in both directions, or only in the incoming direction?
I ask this for two reasons:
1. If data is being corrupted in the outgoing direction, it would explain the SSL problems which were reported during the same time window.
2. If data is being corrupted in the outgoing direction, S3 GET responses would also have been corrupted; and these are unfortunately not protected by Content-MD5 headers.
|
|
Posts:
5,320
Registered:
3/19/07
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 7:19 PM PDT
in response to: Colin Percival
|
|
|
GET's can be protected by comparing the value of the ETag header to the MD5 of the returned object. If they are equal, there was no transmission error.
|
|
Posts:
278
Registered:
4/10/06
|
|
|
|
Re: S3 data corruption?
Posted:
Jun 23, 2008 7:50 PM PDT
in response to: Allen
|
|
|
Allen,
Complete GETs can, but not partial GETs.
|
|
|
|