KwikCluster

KwikCluster [1] using MinHash [2] as a match function. Generally speaking, this package clusters text documents such that similar documents are in the same cluster and dissimilar documents are in different clusters, where similarity is measured according to the Jaccard coefficient. The computational time of this process scales linearly with respect to the number of documents, which enables clustering of massive corpuses.

Basic usage with flat text files:

usage: KwikCluster.py [-h] [--threshold THRESHOLD]
                      [--number-hash-functions NUMBER_HASH_FUNCTIONS]
                      [--number-processes NUMBER_PROCESSES]
                      [--max-lines MAX_LINES]
                      input_file_path output_file_path

positional arguments:
  input_file_path       Path to text file to cluster. One document per line.
  output_file_path      Path to output cluster results. One cluster per line,
                        with space-delimited cluster members referenced
                        according to zero-indexed line number in input-file-
                        path.

optional arguments:
  -h, --help            show this help message and exit
  --threshold THRESHOLD
                        Jaccard score cutoff threshold for a match between two
                        documents. (default: 0.9)
  --number-hash-functions NUMBER_HASH_FUNCTIONS
                        Jaccard score cutoff threshold for a match between two
                        documents. (default: 200)
  --number-processes NUMBER_PROCESSES
                        Number of parallel processes for hashing documents.
                        (default: 1)
  --max-lines MAX_LINES
                        Maximum number of lines to read from input-file-path.
                        (default: inf)

More than basic usage

For custom document feeding and match functions, see the simple tutorial in example.py.

Consensus clustering

This package also implements consensus clustering, which combines multiple clusterings into a single clustering according to the objective in [1]. For an example usage, see example_consensus.py.

References:

Ailon, N., Charikar, M., & Newman, A. (2008). Aggregating inconsistent information. Journal of the ACM, 55(5),1–27. http://doi.org/10.1145/1411509.1411513
Broder, A. Z. (1997). On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), 1–9. http://doi.org/10.1109/SEQUEN.1997.666900

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
test		test
.gitignore		.gitignore
KwikCluster.py		KwikCluster.py
LICENSE		LICENSE
MinHash.py		MinHash.py
README.md		README.md
__init__.py		__init__.py
example.py		example.py
example_consensus.py		example_consensus.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KwikCluster

Basic usage with flat text files:

More than basic usage

Consensus clustering

References:

About

Releases

Packages

Languages

License

mbarnes1/KwikCluster

Folders and files

Latest commit

History

Repository files navigation

KwikCluster

Basic usage with flat text files:

More than basic usage

Consensus clustering

References:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages