FastCDC

This is a Go implementation of the FastCDC algorithm for content-defined chunking. CDC is a technique used in data deduplication and data storage systems to break data into variable-sized chunks based on its content rather than fixed block sizes. This approach aims to improve the efficiency of deduplication.

Usage

go get -u codeberg.org/mhofmann/fastcdc

Evaluation

The implementation can be used with the same parameters as in the FastCDC paper as well as user-provided values for minimum, average and maximum sizes of chunks. For comparison, the following table shows statistics about the number and size of chunks generated by chunking sets of test files with different parameters. The numbers in the chunker names refer to the parameters used. For example "2k-8k-64k" is a chunker with 2KB minSize 8KB avgSize and 64k maxSize. The test corpus had a total uncompressed size of 8182081670 bytes (~7.6GB) and consisted of technical manuals and drawings in PDF format and tarballs containing the source code of 5 different versions of the Linux kernel.

Results

Chunker	Num. of Chunks	Avg. chunk size	Deduplicated size	Deduplication ratio
reference	480942	9831	4727992736	1.73
2k-16k-64k	271140	19136	5188457451	1.58
2k-32k-64k	150041	37254	5589600419	1.46
2k-64k-128k	80946	73123	5919028334	1.38
4k-8k-64k	471195	10107	4762223233	1.72
4k-16k-64k	266596	19503	5199438463	1.57
4k-32k-64k	148487	37669	5593355619	1.46
4k-64k-128k	80332	73701	5920577574	1.38

In terms of pure deduplication performance, the reference parameters (2k-8k-64k) yielded the best result on the test dataset. For storage systems where chunks are stored compressed, El-Shimi et al. suggest that using larger chunk sizes for CDC can improve the performance of the compression algorithm and thereby reduce the effective storage size. If and how far this applies to the FastCDC algorithm remains to be tested in the future.

License

BSD-2-Clause. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
testdata		testdata
LICENSE		LICENSE
README.md		README.md
example_test.go		example_test.go
fastcdc.go		fastcdc.go
fastcdc_test.go		fastcdc_test.go
gear.go		gear.go
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastCDC

Usage

Evaluation

Results

License

About

Languages

License

mversiotech/fastcdc

Folders and files

Latest commit

History

Repository files navigation

FastCDC

Usage

Evaluation

Results

License

About

Resources

License

Stars

Watchers

Forks

Languages