Skip to content

mbarnes1/KwikCluster

Repository files navigation

KwikCluster

KwikCluster [1] using MinHash [2] as a match function. Generally speaking, this package clusters text documents such that similar documents are in the same cluster and dissimilar documents are in different clusters, where similarity is measured according to the Jaccard coefficient. The computational time of this process scales linearly with respect to the number of documents, which enables clustering of massive corpuses.

Basic usage with flat text files:

usage: KwikCluster.py [-h] [--threshold THRESHOLD]
                      [--number-hash-functions NUMBER_HASH_FUNCTIONS]
                      [--number-processes NUMBER_PROCESSES]
                      [--max-lines MAX_LINES]
                      input_file_path output_file_path

positional arguments:
  input_file_path       Path to text file to cluster. One document per line.
  output_file_path      Path to output cluster results. One cluster per line,
                        with space-delimited cluster members referenced
                        according to zero-indexed line number in input-file-
                        path.

optional arguments:
  -h, --help            show this help message and exit
  --threshold THRESHOLD
                        Jaccard score cutoff threshold for a match between two
                        documents. (default: 0.9)
  --number-hash-functions NUMBER_HASH_FUNCTIONS
                        Jaccard score cutoff threshold for a match between two
                        documents. (default: 200)
  --number-processes NUMBER_PROCESSES
                        Number of parallel processes for hashing documents.
                        (default: 1)
  --max-lines MAX_LINES
                        Maximum number of lines to read from input-file-path.
                        (default: inf)

More than basic usage

For custom document feeding and match functions, see the simple tutorial in example.py.

Consensus clustering

This package also implements consensus clustering, which combines multiple clusterings into a single clustering according to the objective in [1]. For an example usage, see example_consensus.py.

References:

  1. Ailon, N., Charikar, M., & Newman, A. (2008). Aggregating inconsistent information. Journal of the ACM, 55(5),1–27. http://doi.org/10.1145/1411509.1411513
  2. Broder, A. Z. (1997). On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), 1–9. http://doi.org/10.1109/SEQUEN.1997.666900

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages