about | usage | input | arguments | benchmarks | resources
Given a database of size N
and a query vector of dimensionality d
, the problem of identifying exact or nearly exact matches to the query in the database takes time Ω(Nd)
. When given another database of M
such queries, this problem requires Ω(NMd)
time. For large databases, this problem quickly becomes intractable.
To remedy this, we can employ approximation methods. Locality-Sensitive Hashing (LSH) is one such method. MinHashing is an LSH scheme which defines vector similarity using the Jaccard similarity coefficient. This project provides a multithreaded implementation of MinHashing in C using the pthreads library. A great explanation both of LSH and of MinHashing (as well as numerous other fantastic topics) can be found in the amazing text Mining of Massive Datasets[1].
The project is built using the make
utility. Navigate into the project directory and type
make
to build the project. The following command will identify all pairs of similar vectors in bench/test.csv
using a single thread.
cat bench/test.csv | ./mhash -b 2 -r 10 -t 1 -E
The program takes one or two databases of vectors as input. Consider the following format:
N,D,d m1,f1,f2,...,fm1 m2,fm1+1,fm2+2,...,fm1+m2 ... mN,fD-mN+1,fD-mN+2,...,fD
where N
is the number of vectors in the database, D
is the total number of features in all N
vectors, and d
is the dimensionality, i.e. # of possible features, of each vector. A single such database can be provided and compared to itself using the -E
option. If another, distinct database is to be provided, it should be provided via standard input immediately after the first.
-b <int> number of bands per signature (n=r*b) (required)
-r <int> size of each band (# rows) (required)
-s <int> seed for reproducibility (default: 0)
-t <int> number of threads to use (default: 1)
-E enables exclusive comparison mode, prevents
observations with the same index being paired
--verbose enables verbosity, i.e. a bunch of debugging
garbage will be thrown into standard output
--no-print pairing results will not be sent to stdout
The expected probability of any two vectors with similarity s
being paired when using b
bands and r
rows is 1-(1-sr)b
[1]. A successful implementation must provide this theoretical guarantee. We verify this is the case with our implementation as follows:
- Generate a dataset
D
by sampling from the space of all possible vectors uniformly at random. - Run mhash on
D
with various settings ofb
andr
. - Compute the proportion of called pairs at each similarity and place pairs into buckets (this is done because the distribution of similarities is discrete).
- Graph both the theoretical curve and this empirical curve. The latter curve should approximate the former.
One would expect a larger dataset with higher dimensional vectors to yield a better approximation to the theoretical curve for all settings of b
and r
. The graphs below seem to support this intuition. (T)
denotes a theoretical curve computed using the above formula.
F = 96, N = 1000 | F = 480, N = 4000 |
We have also investigated timing requirements and peak memory usage of mhash over various settings and inputs. These are provided below. All memory results are provided in kilobytes (kB) and all timing results are provided in seconds. All benchmarking experiments were run on an Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz with 1.5 TB of RAM. Only one thread was used per mhash instance. Every data point is the average of 100 runs.
We used the --no-print
flag when running mhash to suppress outputting identified pairs. The running time will be heavily influenced by the number of pairs mhash identifies and subsequently outputs, which can be quadratic in the size of the dataset in the worse case. The ouput quickly becomes the bottleneck when more than a number of pairs linear in the size of the input are identified.
F = 96 | F = 480 |
F = 96 | F = 480 |
We have provided the necessary tools to repeat this analysis here. Raw data for the above graphs can be found here.
[1] Mining of Massive Datasets, J. Leskovec et al.