How to make Faiss run faster

This page contains all the options and tricks to make Faiss run faster, including the speed-vs-accuracy tradeoffs where the speed is more important.

Huge memory pages pages

It is a typical situation that Faiss uses the random access pattern to reach the data that it needs. The most typical example is the access to codebooks. As a result, TLB cache operates under a very heavy pressure. It was observed that the use of 2M / 1G huge memory pages helps in certain cases. In particular, the use of 2M / 1G pages allowed to gain up to 20% speed increase in our certain vector codec experiments on x86-64 platform. Unfortunately, Faiss does not contain a built-in support for marking certain allocated memory blocks as huge-pages-are-important-here ones. So, the simpliest way to try out huge memory pages in your application is to use memory allocators that support ones, such as mimalloc.

We did not conduct any experiments with huge memory pages on ARM platform yet.

The recipe: try a memory allocator that supports huge memory pages and try to run your application with ones being enabled. Please be aware that huge memory pages is a scarce resources and it needs to be managed carefully.

NUMA nodes

The core team is aware that Faiss does not use support the operation under the multiple NUMA nodes. This means that Faiss does not have certain internal facilities that, for example, coordinate memory allocations in order to minimize the traffic between the NUMA nodes. As an example, we had an observation where two runs the same benchmark on a single 12C/24T NUMA node and on four NUMA nodes on the same machine would yield the same running time!

The recipe: try playing with the number of cpu resources that you allocate for your application.

Intel MKL library

Faiss uses Intel MKL library, if possible. MKL uses OpenMP library under the hood. According to the MKL manual, it runs the best if the number of threads equals to the number of physical cores (for example).

The recipe: try setting the number of used OMP threads to the number of physical CPU cores, unless you're using a version of Faiss without any multithreading. So, on a 12C/24T machine run your application as 'OMP_NUM_THREADS=12 my-application'.

The sizes of various buffers

Faiss is optimized for working with batches of samples. So, if one passes a huge dataset, then it might be possible that the dataset is going to be processed in chunks inside Faiss. It is possible to speedup the overall processing by increasing the sizes of chunks, thus increasing the effectiveness of cache and linear algebra operations, but increasing the amount of RAM for the allocated chunks. Here is the list of various buffer sizes that one can vary:

(todo: add)

Intrinsics and AVX-512

Faiss uses certain custom kernels that involve the usage AVX2 or ARM NEON intrinsics. Currently, Faiss does not provide specialized AVX-512 kernels (except for faiss/utils/distances_fused/avx512.h, which needs to be enabled manually). Historically, there are three reasons for this:

AVX-512 computations lead to CPU downclocking, which typically result in a lower performance (compared vs AVX2 version).
AVX-512 kernels need developer time and dedication, and kernels may be tricky to create and benchmark (perf, Intel VTune, uica.uops.info, etc.)
AVX-512 increases the compilation time. Also, Faiss may use Intel MKL library that is capable of using AVX-512 under the hood (if MKL decides to), and it common that GEMM operations take the most of the time.

As of today (Jan 2023):

AVX-512 downclocking seems to be fixed in new generations of Intel CPUs.
AMD Zen4 seems to provide support for AVX-512 instructions.
bfloat16 instructions come to AVX-512, which can be used for certain cases, such as Additive Quantizers. But it needs to be created and benchmarked first.

So, it is possible that Faiss will add the support of AVX-512 in the future.

Faiss may use Intel MKL library, which allows to enable/disable certain instruction sets via the following. Maybe, it will be beneficial for your CPU to disable AVX-512 for MKL.

The hardware: CPU

(todo: add)

The hardware: GPU

(todo: add)

Speed vs accuracy: pick the index family

(todo: add)

Speed vs accuracy: pick the index parameters

(todo: add)

Speed vs accuracy: approximate versions of algorithms

(todo: add about top-k heap)

Speed vs accuracy: k-means

By default, k-means implementation in faiss/Clustering.h uses 25 iterations and max 256 samples from the input dataset per cluster needed. IVF-based indices use 10 iterations (faiss/IndexIVF.h, faiss/BinaryIVF.h, faiss/gpu/GpuIndexIvf.cu).

One may play with these numbers.

Advanced: C++ templates

Faiss is implemented in C++ and it heavily utilizes template dispatching for various cases. For example, this is the function (faiss/utils/distances_simd.cpp) that computes inner products between a single d-dimenionsal vector x and ny d-dimensional vectors y. As one may see, there are specialized cases for the dimensionalities of 1, 2, 4, 8 and 12, which reference to custom kernels, otherwise the implementation relies on a general-purposes code and a compiler.

void fvec_inner_products_ny(
        float* dis,
        const float* x,
        const float* y,
        size_t d,
        size_t ny) {
#define DISPATCH(dval)                                  \
    case dval:                                          \
        fvec_op_ny_D##dval<ElementOpIP>(dis, x, y, ny); \
        return;

    switch (d) {
        DISPATCH(1)
        DISPATCH(2)
        DISPATCH(4)
        DISPATCH(8)
        DISPATCH(12)
        default:
            fvec_inner_products_ny_ref(dis, x, y, d, ny);
            return;
    }
#undef DISPATCH
}

Custom specialized kernels tend to show better performance than a general-purpose code, but at the cost of the binary size.

Please feel free to add some custom kernels that fit your case, if appropriate, and send us a pull request with your change! :)

Advanced: fused kernels

faiss/utils/distances_fused/distances_fused.h contains a fused kernel that combines the L2 distance computation and the closest nearest search neighbor search.

Feedback

All the feedback regarding the Faiss execution speed is highly appreciated! The most valuable feedback would be a reproducible standalone python script or a C++ small program that allows to detect hot-spots inside Faiss.