`find-duplicates`

find-duplicates finds duplicate files quickly based on the xxHashes of their contents.

Installation

$ go install github.com/twpayne/find-duplicates@latest

Example

$ find-duplicates
{
  "cdb8979062cbdf9c169563ccc54704f0": [
    ".git/refs/remotes/origin/main",
    ".git/refs/heads/main",
    ".git/ORIG_HEAD"
  ]
}

Usage

find-duplicates [options] [paths...]

paths are directories to walk recursively. If no paths are given then the current directory is walked.

The output is a JSON object with properties for each observed xxHash and values arrays of filenames with contents with that xxHash.

Options are:

--keep-going or -k keep going after errors.

--output=<file> or -o <file> write output to <file>, default is stdout.

--threshold=<int> or -t <int> sets the minimum number of files with the same content to be considered duplicates. The default is 2.

--statistics or -s prints statistics to stderr.

How does `find-duplicates` work?

find-duplicates aims to be as fast as possible by doing as little work as possible, using each CPU core efficiently, and using all the CPU cores on your machine.

It consists of multiple components:

Firstly, it walks the the filesystem concurrently, spawning one goroutine per subdirectory.
Secondly, with the observation that files can only be duplicates if they are the same size, it only reads file contents once it has found at more than one file with the same size. This significantly reduces both the number of syscalls and the amount of data read. Furthermore, as the shortest possible runtime is the time taken to read the largest file, larger files are read earlier.
Thirdly, files contents are hashed with a fast, non-cryptographic hash.

All components run concurrently.

Media

"Finding duplicate files unbelievably fast: a small CLI project using Go's concurrency" talk from Zürich Gophers.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
internal/dupfind		internal/dupfind
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`find-duplicates`

Installation

Example

Usage

How does `find-duplicates` work?

Media

License

About

Releases

Contributors 4

Languages

License

twpayne/find-duplicates

Folders and files

Latest commit

History

Repository files navigation

find-duplicates

Installation

Example

Usage

How does find-duplicates work?

Media

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Contributors 4

Languages

`find-duplicates`

How does `find-duplicates` work?