-
Notifications
You must be signed in to change notification settings - Fork 3.7k
How to make Faiss run faster
This page contains all the options and tricks to make Faiss run faster, including the speed-vs-accuracy tradeoffs where the speed is more important.
The topics are provided in no particular order of importance.
(todo: add)
It is very common to instantiate an index via faiss.index_factory()
call. But this way of instantiating does not allow to pick many of the available speed-related parameters. Many of Faiss components may utilize:
- iterative algorithms; the more iterations performed, the better the recall rate is.
- heuristics that limit the number of possible candidates considered during the training or the search process, such as beam search. The more candidates are evaluated, the better the recall rate is.
- additional optional versions, such as LUT-based approximations; they allow to trade the speed for accuracy.
- various internal buffers for processing an input data in chunks, mentioned on this page.
- random number generators; these ones are used to ensure that results can be reproduced.
The default index parameters should work reasonably good in the most general cases. But it is highly recommended to study the available parameters in order to find an optimal point in your particular speed-vs-accuracy-vs-RAM case.
The recipe: study the available parameters of a chosen index and its subindices.
K-means clustering is an often used facility inside Faiss. By default, k-means implementation in faiss/Clustering.h uses 25 iterations (niter
variable) and up to 256 samples from the input dataset per cluster needed (max_points_per_centroid
variable). The samples are chosen randomly. For example, the default PQx12
training is ~4x slower than PQx10
training and ~16x slower than PQx8
training because of the proportionally higher number of used samples from the input dataset.
IVF-based indices use 10 iterations of k-means by default (faiss/IndexIVF.h, faiss/IndexBinaryIVF.h, faiss/gpu/GpuIndexIVF.cu).
The recipe: play with parameters of k-means clustering niter
and max_points_per_centroid
.
Faiss is optimized for working with batches of samples, rather than processing samples one by one.
The recipe: try to process your samples in batches, if appropriate.
It is a typical situation that Faiss uses the random access pattern to access the needed data, such as codebooks. As a result, TLB cache operates under a very heavy pressure. It was observed that the use of 2M / 1G huge memory pages helps in certain cases. In particular, the use of 2M / 1G pages allowed to gain up to 20% speed increase in our certain vector codec experiments on x86-64 platform. Unfortunately, Faiss does not contain a built-in support for marking certain allocated memory blocks as huge-pages-are-important-here ones. So, the simplest way to try out huge memory pages in your application is to use a memory allocator that support ones, such as mimalloc.
We did not conduct any experiments with huge memory pages on ARM platform yet.
The recipe: try a memory allocator that supports huge memory pages and try to run your application with ones being enabled. Please be aware that huge memory pages is a scarce resources and it needs to be managed carefully.
The core team is aware that Faiss does not have a NUMA-supported code. This means that Faiss does not have certain internal facilities that, for example, coordinate memory allocations in order to minimize the traffic between the NUMA nodes. As an example, we had an observation where two runs of the same benchmark on a single 12C/24T NUMA node and on four NUMA nodes on the same machine would yield the same running time!
The recipe: try playing with the number of cpu resources that you allocate for your application.
Faiss uses Intel MKL library, if possible. MKL uses OpenMP library under the hood. According to the MKL manual, it runs the best if the number of used threads equals to the number of physical cores (for example), which is in agreement with our benchmarks.
The recipe: try setting the number of used OMP threads to the number of physical CPU cores, unless you're using a version of Faiss without any multithreading. So, on a 12C/24T machine run your application as OMP_NUM_THREADS=12 my-application
.
If a huge dataset is passed as an input, then it might be possible that the dataset will be split into chunks and processed chunk-by-chunk under the hood. It is possible to speedup the overall processing by increasing the size of a chunk, thus increasing the effectiveness of cache and linear algebra operations. On the other hand, it increases the amount of RAM for the allocated chunks and the time that is needed for such allocations. Here is the list of various buffer sizes that one may vary:
- faiss::LocalSearchQuantizer.chunk_size
- faiss::AdditiveQuantizer.max_mem_distances
- faiss::multi_index_quantizer_search_bs
- faiss::index2layer_sa_encode_bs
- faiss::product_quantizer_compute_codes_bs
(todo: add others)
Faiss uses certain custom kernels that involve the usage AVX2 or ARM NEON intrinsics. Currently, Faiss does not provide specialized AVX-512 kernels (except for faiss/utils/distances_fused/avx512.h, which needs to be enabled manually). Historically, there are three reasons for this:
- AVX-512 computations lead to CPU downclocking, which typically result in a lower performance (compared vs AVX2 version).
- AVX-512 kernels need developer time and dedication, and kernels may be tricky to create and benchmark.
- AVX-512 would increase the compilation time on our end, because it would introduce a new set of binaries. Also, Faiss may use Intel MKL library that is capable of using AVX-512 under the hood (if MKL decides to), and it common that GEMM operations take the most of the time.
As of today (Jan 2023):
- AVX-512 downclocking seems to be fixed in new generations of Intel CPUs.
- AMD Zen4 seems to provide support for AVX-512 instructions.
- bfloat16 instructions come to AVX-512, which can be used for certain cases, such as Additive Quantizers. But it needs to be created and benchmarked first.
So, it is possible that Faiss will add some AVX-512 kernels in the future.
Faiss may use Intel MKL library, which allows to enable/disable certain instruction sets via the following, if it fits your case and/or if the frequency downclocking is not beneficial.
(todo: add)
(todo: add)
(todo: add about top-k heap)
Allows to trade some very minor amount of RAM for a speedup of pq.search()
call.
This can be used for IndexPQ and IndexIVFPQ as well, for example:
# train an ivfpq index
# create the tables
index.pq.sync_transposed_centroids()
# do the search
while app_is_running:
D, I = index.search(xq, k)
(todo: add)
Faiss is implemented in C++ and it heavily utilizes template dispatching for various cases. For example, this is the function from faiss/utils/distances_simd.cpp that computes inner products between a single d
-dimenionsal vector x
and ny
d
-dimensional vectors y
. As one may see, there are specialized cases for the dimensionalities of 1, 2, 4, 8 and 12, which reference to custom kernels, otherwise the implementation relies on a general-purposes code and a compiler.
void fvec_inner_products_ny(
float* dis,
const float* x,
const float* y,
size_t d,
size_t ny) {
#define DISPATCH(dval) \
case dval: \
fvec_op_ny_D##dval<ElementOpIP>(dis, x, y, ny); \
return;
switch (d) {
DISPATCH(1)
DISPATCH(2)
DISPATCH(4)
DISPATCH(8)
DISPATCH(12)
default:
fvec_inner_products_ny_ref(dis, x, y, d, ny);
return;
}
#undef DISPATCH
}
Custom specialized kernels tend to show better performance than a general-purpose code, but at the cost of the binary size.
Please feel free to add some custom kernels that fit your case, if appropriate, and send us a pull request with your change! :)
faiss/utils/distances_fused/distances_fused.h contains a fused kernel that combines the L2 distance computation and the closest nearest search neighbor search.
All the feedback regarding the Faiss execution speed is highly appreciated! The most valuable feedback would be a reproducible standalone python script or a C++ small program that allows to detect hot-spots inside Faiss.
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark