S3SuperSync

Objective

Deduplication of chunks sent to s3 by copying parts that are already available in other objects.

NOTE This is not an attempt to conserve storage useage on S3, infact the exact opposite, the idea is the more data you have in S3 the less you have to transfer. This requires a versioned bucket.

How it works

In short S3SuperSync faciliates a multipart upload to S3 sending the least amount of chunks possible. It chunks files and hashes those chunks with two different hashing algorithms. The parts are tracked via a DynamoDB table where the hashes create a composite key. A location column in the DynamoDB table instructs S3SuperSync on where to find chunks that are identical to the chunk to be upload, then uses those chunks to copy from S3 to S3 rather then sending the chunk from your local site.

Chunking

S3 multipart uploads consist of chunks no smaller than 5MB with exception for the last chunk. The number of parts in a multipart upload are limited to 10,000. S3SuperSync checks to see if the destination file already exists, if it does it will attempt to use it's part size. This application uses a 8MB default chunk size. If the size of your file divided by 8MB equates to more than 10,000 parts the chunk size is doubled until True.

Hashing

Each part is hashed with the sha3-512 and blake2b algorithms. Two algorithms are used to lower the possibility of hash colisions.

Metadata

A DynamoDB table name is provided by the user, the table is created if it does not exist. If the table does exist it's checked for the approprate primary key attributes. The table name defaults to 'supersync'. The partition key is named sha3, the sort key is named blake2. A third item attribute is used to store a list of objects that instruct the system on where it can find a particular chunk. After a chunk is hashed the table is checked for a match, if a match is found the multipart upload for that part is a copy from an object in s3, if the hashes are not found the chunk is uploaded for that part in the multipart upload. Example item:

{
  "sha3": "HASH",
  "blake2": "RANGE",
  "locations": [
    {
      "bucket": "String",
      "key": "String",
      "part": "Int",
      "version": "String",
      "upload_id": "String",
      "content_range": "String"
    }
  ]
}

Concurrency

S3SuperSync defaults to 10 concurrent processes.

Usage

usage: supersync.py [-h] [-d] [-p PROFILE] [-c CONCURRENCY] [-t TABLE_NAME]
                    local dest

Sync file changes to s3

positional arguments:
  local                 Local file to be synced
  dest                  Destination of file to be synced

optional arguments:
  -h, --help            show this help message and exit
  -d, --debug           Turn on log level debug
  -p PROFILE, --profile PROFILE
                        AWS Profile to use.
  -c CONCURRENCY, --concurrency CONCURRENCY
                        Number of processes to use.
  -t TABLE_NAME, --table_name TABLE_NAME
                        DynamoDB table name too use.
  -s {default,fast}, --speed {default,fast}
                        Hash speed option. Warning: fast hash may result in
                        colisions and data coruption. Automatically adds speed
                        type to table name because the attributes differ.

Fast Hash Warning

I've implemented what looks like a faster hashing algorithm for when the hashing is a bottle neck. This hash is xxhash and is not a cryptographic hash and therefore more prone to colisions. Also when using fast hash there is only one hashing algorithm used which further increases the likelyhood of colisions. When using this option it is highly advised to use a separate table_name per item uploaded.

ToDo

File locking for writes
Info output
Abort multipart upload on error
Clean Ctr-C
Walk directory structures
Support for files less than 5/8MB
Encryption Support
Create indexer that will scan S3 and index what chunks already exist
Stats about how many chunks are deduplicated per item
Status
Retry decorator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

S3SuperSync

Objective

How it works

Chunking

Hashing

Metadata

Concurrency

Usage

Fast Hash Warning

ToDo

Stats

31GB Upload VMDK

First Upload

No Change Upload @ 50 r/w Capacity w/ ~10 w/r used 10 concurrency

Small Change Upload @ same perf as prior

Files

README.md

Latest commit

History

README.md

File metadata and controls

S3SuperSync

Objective

How it works

Chunking

Hashing

Metadata

Concurrency

Usage

Fast Hash Warning

ToDo

Stats

31GB Upload VMDK

First Upload

No Change Upload @ 50 r/w Capacity w/ ~10 w/r used 10 concurrency

Small Change Upload @ same perf as prior