This is a Go implementation of the FastCDC algorithm for content-defined chunking. CDC is a technique used in data deduplication and data storage systems to break data into variable-sized chunks based on its content rather than fixed block sizes. This approach aims to improve the efficiency of deduplication.
go get -u codeberg.org/mhofmann/fastcdc
The implementation can be used with the same parameters as in the FastCDC paper as well as user-provided values for minimum, average and maximum sizes of chunks. For comparison, the following table shows statistics about the number and size of chunks generated by chunking sets of test files with different parameters. The numbers in the chunker names refer to the parameters used. For example "2k-8k-64k" is a chunker with 2KB minSize 8KB avgSize and 64k maxSize. The test corpus had a total uncompressed size of 8182081670 bytes (~7.6GB) and consisted of technical manuals and drawings in PDF format and tarballs containing the source code of 5 different versions of the Linux kernel.
Chunker | Num. of Chunks | Avg. chunk size | Deduplicated size | Deduplication ratio |
---|---|---|---|---|
reference | 480942 | 9831 | 4727992736 | 1.73 |
2k-16k-64k | 271140 | 19136 | 5188457451 | 1.58 |
2k-32k-64k | 150041 | 37254 | 5589600419 | 1.46 |
2k-64k-128k | 80946 | 73123 | 5919028334 | 1.38 |
4k-8k-64k | 471195 | 10107 | 4762223233 | 1.72 |
4k-16k-64k | 266596 | 19503 | 5199438463 | 1.57 |
4k-32k-64k | 148487 | 37669 | 5593355619 | 1.46 |
4k-64k-128k | 80332 | 73701 | 5920577574 | 1.38 |
In terms of pure deduplication performance, the reference parameters (2k-8k-64k) yielded the best result on the test dataset. For storage systems where chunks are stored compressed, El-Shimi et al. suggest that using larger chunk sizes for CDC can improve the performance of the compression algorithm and thereby reduce the effective storage size. If and how far this applies to the FastCDC algorithm remains to be tested in the future.
BSD-2-Clause. See LICENSE for details.