Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify uploads using ETag #2585

Closed
hyperknot opened this issue May 6, 2017 · 4 comments
Closed

Verify uploads using ETag #2585

hyperknot opened this issue May 6, 2017 · 4 comments
Labels
closing-soon This issue will automatically close in 4 days unless further comments are made. feature-request A feature should be added or improved.

Comments

@hyperknot
Copy link

hyperknot commented May 6, 2017

Based on StackOverflow answers, I wrote a Python implementation which correctly calculates both multi-part and single-part file ETags.

def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []
    
    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))
    
    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())
    
    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

I've tested it on different file sizes the the calculation is correct. Would it be possible to implement it to aws s3 cp and aws s3 sync functions as a verification step?

@JordonPhillips
Copy link
Member

The problem is that for downloads there's no way of knowing how big each chunk is so we can't reliably reproduce etags.

For uploads we send a Content-MD5 on each chunk that S3 verifies in addition to the protections TLS offers. Optionally, we can send a Content-SHA256 that S3 will also verify, but is less prone to collisions. You can set in the config file by using payload_signing_enabled.

@JordonPhillips JordonPhillips added closing-soon This issue will automatically close in 4 days unless further comments are made. feature-request A feature should be added or improved. labels May 8, 2017
@hyperknot
Copy link
Author

hyperknot commented May 8, 2017

My reasoning was that this way we can checksum the whole file. That'd make sure that the whole splitting and joining was working as intended.

You are right, for download, the only way to validate the file would be to start using some meta tag, like how rclone or s4cmd does it. I'd actually recommend this.

// Just a quick note: do I understand right that there are 3 levels of hash checking for each chunk:

  1. TLS level for HTTPS endpoints
  2. MD5 for each chunk, always present?
  3. SHA256 for each chunk, only present on when payload_signing_enabled is configured so?

The linked docs isn't clear to me about when is and when 2. and 3. is enabled.

@JordonPhillips
Copy link
Member

MD5 is always present unless you remove the handler that adds it. We don't explicitly document this as a thing you can do, but it is possible.

The default for SHA256 is a bit confusing. It is enabled by default for all non-streaming requests (streaming being uploading / downloading). For streaming requests it is disabled by default if all of the following conditions are met:

  • The request includes a content-MD5
  • The request is going over https
  • The client has no explicit configuration for SHA256

The payload_signing_enabled flag is much clearer: False = no SHA256, True = always SHA256.

I understand wanting a whole-file checksum, but I would not want to rely on undocumented behavior for content verification. The s3 docs only say that multipart upload etags are not MD5 digests. If S3 were to come out with an official algorithm I would support adding it in.

The next best option is to do custom hashing using an explicit mechanism and putting that hash in the object metadata. This is off the table for us because we have a policy of never implicitly adding data, especially when it would cost money. While metadata doesn't cost additional money (iirc), you are pretty heavily restricted on how much each object can have (2kb). Implicitly sending up our own metadata would further limit how much can otherwise be provided.

@hyperknot
Copy link
Author

Thanks for the great explanation about MD5 and SHA256, most of the information from your comment would be great in the official help!

For example that MD5 is always verified, chunk-by-chunk, by default in all scenarios.

OK, I agree about not verifying the whole file if the ETag calculation isn't official. I'd be surprised if it was changed after so many years, but I understand your point.

tomwassenberg added a commit to svsticky/sadserver that referenced this issue Aug 3, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Aug 6, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Aug 10, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Aug 10, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Aug 11, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Aug 11, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Sep 11, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Sep 28, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Oct 26, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Oct 31, 2017
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Feb 9, 2018
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a SHA256
checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Feb 9, 2018
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a
SHA256 checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Feb 19, 2018
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a
SHA256 checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Feb 19, 2018
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a
SHA256 checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue May 8, 2018
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a
SHA256 checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
tomwassenberg added a commit to svsticky/sadserver that referenced this issue Jun 14, 2018
Apparently the AWS CLI already validates MD5 checksums when down-/uploading
files. By adding a flag in its config it does this explicitly, using a
SHA256 checksum.
Source: aws/aws-cli#2585 (comment)

This makes our own implementation of hashing superfluous.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closing-soon This issue will automatically close in 4 days unless further comments are made. feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

3 participants