Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 sync and cp commands should have a flag to show the local (and remote) file hashes #6631

Closed
ITmaze opened this issue Dec 30, 2021 · 10 comments
Assignees
Labels
feature-request A feature should be added or improved. s3

Comments

@ITmaze
Copy link

ITmaze commented Dec 30, 2021

Is your feature request related to a problem? Please describe.
When you use aws s3 sync to copy a local directory to S3, the cli calculates each object hash locally before sending it together with the object to S3 - either as a single object or a multipart. After the upload has succeeded, the hash is stored as an ETag on the object. You can retrieve the ETag from the object by adding the --debug flag and manually extracting it from the XML, but you cannot get the CLI to output the hash for the local file.

Describe the solution you'd like
Ultimately it would be extremely helpful if you could compare the hash of a local file with that of the remote object using the same method as used by the aws cli itself. If the two don't match, you could then remove the object from S3 and try again.

Describe alternatives you've considered
Right now all you can do is attempt to calculate the hash locally. There are a few scripts that purport to calculate the value correctly, for example under OSX - with a linux version below it - at: https://gist.github.com/emersonf/7413337 - which appears to work for some files, but not for others. It's unclear if this is due to a failed upload, or a failed hash calculation. The hashes that are different are for some, but not all, files that are 1.6M and smaller.

Additional context
I've uploaded 2TB of data in files as large as 9.5 GB which froze several times over the three days that it took. Restarting the process multiple times eventually finished the process, but I'm left wondering if the upload is actually complete and correct.

@ITmaze ITmaze added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Dec 30, 2021
@tim-finnigan tim-finnigan added the s3 label Jan 3, 2022
@tim-finnigan tim-finnigan self-assigned this Jan 3, 2022
@tim-finnigan
Copy link
Contributor

Hi @ITmaze, thanks for reaching out. Have you looked into the S3 documentation on using the Content-MD5 header?

This premium support article gives a good high-level summary of using the Content-MD5 header to verify the integrity of an object uploaded to S3: https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/

And this CLI documentation goes into further detail: https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html#cli-aws-help-s3-faq

There was also some discussion on this topic here in another issue: #2585

@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 3, 2022
@ITmaze
Copy link
Author

ITmaze commented Jan 4, 2022

Hi @tim-finnigan, thank you. I have seen those pages, but even using the information contained within them gives me spurious results, to the point of raising a case with AWS support, who also point me at the same documents and essentially tell me to RTFM.

I created this feature-request when it occurred to me that all of this edge-case detection as well as multi-step hash construction was unnecessary since the CLI already does the correct hash calculation for each case - since that's how it verifies that the upload was complete.

I'm just asking for having a way to surface both sides of that process, both the source and the target files, so I can check if the source and target files are the same, without needing to upload another 2TB of data.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 4, 2022
@tim-finnigan
Copy link
Contributor

Hi @ITmaze thanks for following up. I understand your point about wanting to ensure that your upload was successful. But the documentation mentioned earlier notes that the CLI will retry validating uploads up to 5 times and then exit if unsuccessful.

And in regard to the request to provide hash data this was addressed in this comment from #2585:

The next best option is to do custom hashing using an explicit mechanism and putting that hash in the object metadata. This is off the table for us because we have a policy of never implicitly adding data, especially when it would cost money. While metadata doesn't cost additional money (iirc), you are pretty heavily restricted on how much each object can have (2kb). Implicitly sending up our own metadata would further limit how much can otherwise be provided.

But there is an older open feature request that mentions these topics: #599

I’m going to close this because of the overlap with #599 but please leave a comment there if you want to mention anything else regarding this request. You could also consider posting in the new re:Post forums to get more input from the S3 community.

@github-actions
Copy link

github-actions bot commented Jan 4, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

@ITmaze
Copy link
Author

ITmaze commented Jan 4, 2022

Hi @tim-finnigan we seem to have misunderstood each other.

I'm not talking about adding any hashes anywhere. The ETag already contains the information we're looking for. It's visible in raw XML when you use the debug flag for an aws s3 sync command.

What I'm asking for is to DISPLAY the hashes for both source and target, since they already exist within the code and are actively used to verify that the upload was completed.

What I'm trying to determine is after the upload has completed, is the file system the same as the s3 bucket.

You assert that it retries to validate 5 times and then exits. I'm trying to determine if the files I'm looking at locally are the same as those that are stored remotely, using the calculation that's already built into the cli.

We're literally talking about adding a flag and two printf statements.

@tim-finnigan
Copy link
Contributor

Hi @ITmaze, thanks for clarifying that and sorry if I misunderstood. I’m saying that based on the documentation you can assume successfully uploaded files should match your local files.

Have you looked into using s3api to get the ETag? Here is an example: https://docs.aws.amazon.com/cli/latest/reference/s3api/head-object.html#examples

@ITmaze
Copy link
Author

ITmaze commented Jan 5, 2022

Hi @tim-finnigan, at the time of upload, sure, perhaps.

What about an hour later? How do I check the hash of the local file against that of the uploaded one, without writing a whole process that does the exact same thing as the CLI does? Not only that, if the CLI behavior changes, where the default size of a part changes for example, any code I write has to accommodate that.

Not only that, from a resource perspective, I've now wasted a week on this matter. You've spent time on it, the AWS support engineers have spent time on it, between them we've collectively spent several thousand dollars on a problem that recurs for anyone doing more than casual uploading of objects to S3.

Sorry to be blunt, but given the numerous posts on this matter, going back YEARS, this feature-request is in my professional opinion a no-brainer, and I say that with 40 years of software development experience.

I'm not sure what the push back is being driven by, but it doesn't make any sense to me in any way.

@tim-finnigan
Copy link
Contributor

Hi @ITmaze, sorry to hear your frustration. We can discuss this more to try and get on the same page.

I want to highlight this ETag documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html, specifically:

“The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted...“

So the ETag can’t be considered a reliable way to verify the integrity of uploads. But that’s what the Content-MD5 header is for. I think what you’re asking for may be closer aligned to this open feature request: aws/aws-sdk#89

@ITmaze
Copy link
Author

ITmaze commented Jan 5, 2022

Hi @tim-finnigan, that's requesting the exact same thing, but in the API.

I'm pointing out that all this has been done INSIDE the CLI ALREADY!

All that has to happen is to print it out.

@tim-finnigan
Copy link
Contributor

Hi @ITmaze, just wanted to help clarify a few points. Multipart uploads are generally used for s3 sync and cp. The default chunk size is 8MB and minimum is 5MB. (source)

The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. (Standard uploads using the PutObject API and multipart using UploadPart API). And that is what is recommended in the API documentation:

To ensure that data is not corrupted when traversing the network, specify the Content-MD5 header in the upload part request. Amazon S3 checks the part data against the provided MD5 value. If they do not match, Amazon S3 returns an error.

But the overall validation happens server side using a calculation involving the combined hashes. The CLI does not verify the whole, assembled file. Generally speaking the CLI isn't doing anything special here, just what S3 provides.

(For more information on the multipart upload process please refer to this documentation.)

And another thing worth highlighting from the ETag description mentioned before is:

Objects created by either the Multipart Upload or Part Copy operation have ETags that are not MD5 digests, regardless of the method of encryption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. s3
Projects
None yet
Development

No branches or pull requests

2 participants