Skip to content

Binary Indexes

Matthijs Douze edited this page Mar 23, 2020 · 12 revisions

Overview

Faiss supports indexing binary vectors (with Hamming distance), with the IndexBinaryFlat, IndexBinaryIVF and IndexBinaryHNSW and IndexBinaryHash/IndexBinaryMultiHash indexes (all inheriting from IndexBinary).

Those indexes store the vectors as arrays of bytes, so that a vector of size d takes only d / 8 bytes in memory. Note that at the moment, only vectors with sizes multiple of 8 are supported. Of course, you can round up the vector length if needed.

The input to the add() and search() methods are also byte arrays ("uint8_t" in C++, "uint8" in numpy). The returned indices are 64-bit ints, distances are 32-bit ints.

The Hamming distance computations are optimized using popcount CPU instructions.

IndexBinaryFlat

The "flat" binary index performs an exhaustive search.

The exhaustive search is carefully optimized especially for 256-bit vectors that are quite common.

Batching is applied on the query and the database side to avoid cache misses.

The values of hamming_batch_size and faiss::IndexBinaryFlat#query_batch_size can be customized to adjust the batch sizes but the default values were found to be close to optimal for a large range of settings.

Example

The examples show how to pass in binary data and how to query the index.

In C++
#include <faiss/IndexBinaryFlat.h>

// Dimension of the vectors.
int d = 256;

// Vectors to be indexed, each represented by d / 8 bytes, layed out sequentially,
// i.e. the i-th vector starts at db[i * (d / 8)].
std::vector<uint8_t> db = ...;

// Vectors to be queried from the index.
std::vector<uint8_t> queries = ...;

// Initializing index.
faiss::IndexBinaryFlat index(d);

// Adding the database vectors.
index.add(db.size(), db.data());

// Number of nearest neighbors to retrieve per query vector.
int k = ...;

// Output variables for the queries.
std::vector<int32_t> distances(n * k);
std::vector<faiss::Index::idx_t> labels(n * k);

// Querying the index
index.search(queries.size(), queries.data(), k, distances.data(), labels.data());

// distances[i * k + j] contains the distance from the i-th query vector to its j-th nearest neighbor.
// labels[i * k + j] contains the id of the j-th nearest neighbor of the i-th query vector.
In Python
import faiss

# Dimension of the vectors.
d = 256

# Vectors to be indexed, each represented by d / 8 bytes, layed out sequentially,
# i.e. the i-th vector starts at db[i * (d / 8)].
db = ...

# Vectors to be queried from the index.
queries = ...

# Initializing index.
index = faiss.IndexBinaryFlat(d)

# Adding the database vectors.
index.add(db)

# Number of nearest neighbors to retrieve per query vector.
k = ...;

# Querying the index
D, I = index.search(queries, k)

# D[i, j] contains the distance from the i-th query vector to its j-th nearest neighbor.
# I[i, j] contains the id of the j-th nearest neighbor of the i-th query vector.

IndexBinaryIVF

The "IVF" (Inverse Vector File) flavor speeds up the search by clustering the vectors. This clustering is done using a second (binary) index for quantization (usually a flat index). This is equivalent to the IndexIVFFlat of the floating-point indexes.

Example

In C++
#include <faiss/IndexBinaryIVF.h>

// Dimension of the vectors.
int d = 256;

// Vectors to be indexed, each represented by d / 8 bytes, layed out sequentially,
// i.e. the i-th vector starts at db[i * (d / 8)].
std::vector<uint8_t> db = ...;

// Vectors to train the quantizer.
std::vector<uint8_t> training = ...;

// Vectors to be queried from the index.
std::vector<uint8_t> queries = ...;

// Initializing the quantizer.
faiss::IndexBinaryFlat quantizer(d);

// Number of clusters.
int nlist = ...;

// Initializing index.
faiss::IndexBinaryIVF index(&quantizer, d, nlist);
index.nprobe = 4; // Number of nearest clusters to be searched per query. 

// Training the quantizer.
index.train(training.size(), training.data());

// Adding the database vectors.
index.add(db.size(), db.data());

// Number of nearest neighbors to retrieve per query vector.
int k = ...;

// Output variables for the queries.
std::vector<int32_t> distances(n * k);
std::vector<faiss::idx_t> labels(n * k);

// Querying the index
index.search(queries.size(), queries.data(), k, distances.data(), labels.data());

// distances[i * k + j] contains the distance from the i-th query vector to its j-th nearest neighbor.
// labels[i * k + j] contains the id of the j-th nearest neighbor of the i-th query vector.
In Python
import faiss

# Dimension of the vectors.
d = 256

# Vectors to be indexed, each represented by d / 8 bytes, layed out sequentially,
# i.e. the i-th vector starts at db[i * (d / 8)].
db = ...

# Vectors to train the quantizer.
training = ...

# Vectors to be queried from the index.
queries = ...

# Initializing the quantizer.
quantizer = faiss.IndexBinaryFlat(d)

# Number of clusters.
nlist = ...

# Initializing index.
index = faiss.IndexBinaryIVF(quantizer, d, nlist)
index.nprobe = 4 # Number of nearest clusters to be searched per query. 

# Training the quantizer.
index.train(training)

# Adding the database vectors.
index.add(db)

# Number of nearest neighbors to retrieve per query vector.
k = ...

# Querying the index.
D, I = index.search(queries, k)

# D[i, j] contains the distance from the i-th query vector to its j-th nearest neighbor.
# I[i, j] contains the id of the j-th nearest neighbor of the i-th query vector.

Shorter versions using index factory

The faiss::index_binary_factory() allows for shorter declarations of binary indexes. It is especially useful for IndexBinaryIVF, for which a quantizer needs to be initialized.

How to use index_binary_factory:

In C++

Instead of the above initialization code:

// Initializing the quantizer.
faiss::IndexBinaryFlat quantizer(d);

// Number of clusters.
int nlist = 32;

// Initializing index.
faiss::IndexBinaryIVF index(&quantizer, d, nlist);
index.nprobe = 4; // Number of nearest clusters to be searched per query. 

one could write:

#include <faiss/AutoTune.h>
  
// Initializing the quantizer.
faiss::IndexBinaryIVF *index = dynamic_cast<faiss::IndexBinaryIVF *>(faiss::index_binary_factory(d, "BIVF32"));
index->nprobe = 4; // Number of nearest clusters to be searched per query.
In Python

Instead of the above initialization code:

# Initializing the quantizer.
quantizer = faiss.IndexBinaryFlat(d)

# Number of clusters.
nlist = 32

# Initializing index.
index = faiss.IndexBinaryIVF(quantizer, d, nlist)
index.nprobe = 4 # Number of nearest clusters to be searched per query. 

one could write:

# Initializing the quantizer.
index = faiss.index_binary_factory(d, "BIVF32")
index.nprobe = 4 # Number of nearest clusters to be searched per query.
Clone this wiki locally