Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faiss runs very slowly on M1 Mac #2386

Open
2 of 4 tasks
SupreethRao99 opened this issue Jul 15, 2022 · 18 comments
Open
2 of 4 tasks

Faiss runs very slowly on M1 Mac #2386

SupreethRao99 opened this issue Jul 15, 2022 · 18 comments

Comments

@SupreethRao99
Copy link

SupreethRao99 commented Jul 15, 2022

Summary

running inference on a saved index it is painfully slow on M1 Pro (10 core CPU 16 core GPU). The index is about 3.4Gb in size and takes 1.5 seconds for inference on CPU backend on colab but is taking >20 minutes on M1 CPU, what would be the reason for such slow performance ?

Platform

OS: macOS 12.4
Faiss version: 1.7.2
Installed from: compiled by self following install.md, and this issue

Faiss compilation options:

LDFLAGS="-L/opt/homebrew/opt/llvm/lib" CPPFLAGS="-I/opt/homebrew/opt/llvm/include" CXX=/opt/homebrew/opt/llvm/bin/clang++ CC=/opt/homebrew/opt/llvm/bin/clang cmake -DFAISS_ENABLE_GPU=OFF -B build .

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

The code that I'm running is as follows

import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer

# Training Index
df = pd.read_csv('abcnews-data-text.csv')
data = df.headline_text.to_list()

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
encoded_data = model.encode(data)

index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(data))))
faiss.write_index(index, 'abc_news')

# Inference 

def search(query):
   t=time.time()
   query_vector = model.encode([query])
   k = 5
   top_k = index.search(query_vector, k)
   print('totaltime: {}'.format(time.time()-t))
   return [data[_id] for _id in top_k[1].tolist()[0]]

index = faiss.read_index('abc_news')
query=str(input())
results=search(query)
print('results :')
for result in results:
   print('\t',result)
@wx257osn2
Copy link
Contributor

@SupreethRao99 It seems that you built faiss as Debug mode. -DCMAKE_BUILD_TYPE=Release may help you.

@SupreethRao99
Copy link
Author

@wx257osn2 thank you. I tried the approach that you suggested by adding -DCMAKE_BUILD_TYPE=Release when building FAISS, but inference is still taking >10 mins.

@wx257osn2
Copy link
Contributor

wx257osn2 commented Jul 19, 2022

@SupreethRao99 Thanks trying. Hmm... what the BLAS library did you install? It seems that IndexFlatIP calls them.

@mdouze mdouze added the install label Jul 20, 2022
@mdouze
Copy link
Contributor

mdouze commented Jul 20, 2022

Indeed the speed of flat search is dominated by the BLAS sgemm. Maybe the cmake logs indicate which BLAS version is used.

@SupreethRao99
Copy link
Author

Thank you @mdouze @wx257osn2 , the CMAKE logs are as follows,
CMAKE logs

(SemanticSearch) supreeth@Supreeths-MacBook-Pro faiss % LDFLAGS="-L/opt/homebrew/opt/llvm/lib" CPPFLAGS="-I/opt/homebrew/opt/llvm/include" CXX=/opt/homebrew/opt/llvm/bin/clang++ CC=/opt/homebrew/opt/llvm/bin/clang cmake -DCMAKE_BUILD_TYPE=Release -DFAISS_ENABLE_GPU=OFF -B build .
-- The CXX compiler identification is Clang 14.0.6
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/homebrew/opt/llvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Could NOT find MKL (missing: MKL_LIBRARIES) 
-- Looking for sgemm_
-- Looking for sgemm_ - not found
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/Accelerate.framework  
-- Looking for cheev_
-- Looking for cheev_ - found
-- Found LAPACK: /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/Accelerate.framework;-lm;-ldl  
-- Found SWIG: /opt/homebrew/bin/swig (found version "4.0.2") found components: python 
-- Found Python: /Users/supreeth/miniforge3/envs/SemanticSearch/include/python3.8 (found version "3.8.13") found components: Development NumPy Interpreter Development.Module Development.Embed 
CMake Deprecation Warning at build/_deps/googletest-src/CMakeLists.txt:4 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- The C compiler identification is Clang 14.0.6
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/homebrew/opt/llvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
CMake Deprecation Warning at build/_deps/googletest-src/googletest/CMakeLists.txt:56 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- Found PythonInterp: /Users/supreeth/miniforge3/envs/SemanticSearch/bin/python (found version "3.8.13") 
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/supreeth/SemanticSearch/faiss/build
(SemanticSearch) supreeth@Supreeths-MacBook-Pro faiss % $ make -C build -j faiss
zsh: command not found: $
(SemanticSearch) supreeth@Supreeths-MacBook-Pro faiss % make -C build -j faiss 
[  0%] Building CXX object faiss/CMakeFiles/faiss.dir/AutoTune.cpp.o
[  0%] Building CXX object faiss/CMakeFiles/faiss.dir/IVFlib.cpp.o
[  2%] Building CXX object faiss/CMakeFiles/faiss.dir/Clustering.cpp.o
[  2%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexBinary.cpp.o
[  5%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexAdditiveQuantizer.cpp.o
[  8%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexBinaryHNSW.cpp.o
[  8%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexBinaryHash.cpp.o
[  8%] Building CXX object faiss/CMakeFiles/faiss.dir/Index2Layer.cpp.o
[ 10%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexBinaryFlat.cpp.o
[ 13%] Building CXX object faiss/CMakeFiles/faiss.dir/Index.cpp.o
[ 13%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexBinaryFromFloat.cpp.o
[ 13%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexHNSW.cpp.o
[ 13%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexFlat.cpp.o
[ 16%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexFlatCodes.cpp.o
[ 16%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFAdditiveQuantizer.cpp.o
[ 18%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVF.cpp.o
[ 18%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFSpectralHash.cpp.o
[ 21%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFPQFastScan.cpp.o
[ 24%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexBinaryIVF.cpp.o
[ 27%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFFlat.cpp.o
[ 27%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFPQ.cpp.o
[ 27%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexNNDescent.cpp.o
[ 27%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFPQR.cpp.o
[ 29%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexLSH.cpp.o
[ 32%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFFastScan.cpp.o
[ 35%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexLattice.cpp.o
[ 35%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexIVFAdditiveQuantizerFastScan.cpp.o
[ 35%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexNSG.cpp.o
[ 35%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexFastScan.cpp.o
[ 37%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexPQ.cpp.o
[ 40%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexAdditiveQuantizerFastScan.cpp.o
[ 40%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexPQFastScan.cpp.o
[ 43%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexShards.cpp.o
[ 43%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexScalarQuantizer.cpp.o
[ 45%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexPreTransform.cpp.o
[ 48%] Building CXX object faiss/CMakeFiles/faiss.dir/clone_index.cpp.o
[ 51%] Building CXX object faiss/CMakeFiles/faiss.dir/MetaIndexes.cpp.o
[ 54%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/HNSW.cpp.o
[ 54%] Building CXX object faiss/CMakeFiles/faiss.dir/index_factory.cpp.o
[ 56%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexReplicas.cpp.o
[ 56%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/AdditiveQuantizer.cpp.o
[ 59%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/ProductQuantizer.cpp.o
[ 59%] Building CXX object faiss/CMakeFiles/faiss.dir/VectorTransform.cpp.o
[ 59%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/ScalarQuantizer.cpp.o
[ 59%] Building CXX object faiss/CMakeFiles/faiss.dir/MatrixStats.cpp.o
[ 62%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/AuxIndexStructures.cpp.o
[ 62%] Building CXX object faiss/CMakeFiles/faiss.dir/IndexRefine.cpp.o
[ 62%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/NSG.cpp.o
[ 62%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/PolysemousTraining.cpp.o
[ 62%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/LocalSearchQuantizer.cpp.o
[ 64%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/FaissException.cpp.o
[ 64%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/ResidualQuantizer.cpp.o
[ 67%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/index_read.cpp.o
[ 70%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/ProductAdditiveQuantizer.cpp.o
[ 70%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/pq4_fast_scan.cpp.o
[ 72%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/kmeans1d.cpp.o
[ 72%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/io.cpp.o
[ 72%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/index_write.cpp.o
[ 75%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/pq4_fast_scan_search_1.cpp.o
[ 78%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/lattice_Zn.cpp.o
[ 78%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/pq4_fast_scan_search_qbs.cpp.o
[ 81%] Building CXX object faiss/CMakeFiles/faiss.dir/impl/NNDescent.cpp.o
[ 81%] Building CXX object faiss/CMakeFiles/faiss.dir/invlists/BlockInvertedLists.cpp.o
[ 83%] Building CXX object faiss/CMakeFiles/faiss.dir/invlists/DirectMap.cpp.o
[ 86%] Building CXX object faiss/CMakeFiles/faiss.dir/invlists/InvertedListsIOHook.cpp.o
[ 89%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/distances.cpp.o
[ 89%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/extra_distances.cpp.o
[ 89%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/hamming.cpp.o
[ 89%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/distances_simd.cpp.o
[ 89%] Building CXX object faiss/CMakeFiles/faiss.dir/invlists/InvertedLists.cpp.o
[ 91%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/WorkerThread.cpp.o
[ 91%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/Heap.cpp.o
[ 91%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/quantize_lut.cpp.o
[ 94%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/partitioning.cpp.o
[ 97%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/random.cpp.o
[ 97%] Building CXX object faiss/CMakeFiles/faiss.dir/utils/utils.cpp.o
[100%] Building CXX object faiss/CMakeFiles/faiss.dir/invlists/OnDiskInvertedLists.cpp.o
[100%] Linking CXX static library libfaiss.a
[100%] Built target faiss
(SemanticSearch) supreeth@Supreeths-MacBook-Pro faiss % 

@SupreethRao99
Copy link
Author

also, is there any way in which I can help build and upload FAISS to conda so that people will not have to build from source. Furthermore, are there plans to support GPU acceleration on M1 processors?

@wx257osn2
Copy link
Contributor

wx257osn2 commented Jul 20, 2022

@SupreethRao99

-- Found BLAS: /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/Accelerate.framework

Ah, that is. According to this and this, Apple's Accelerate framework for M1 runs on the AMX coprocessor. It seems that the coprocessor is good at electric efficiency, but not good at run-time speed, especially in multi-thread execution.
Could you try to use OpenBLAS which enabled OpenMP? I'm not sure of that OpenBLAS helps you, but it may does with appropriate thread counts.


Furthermore, are there plans to support GPU acceleration on M1 processors?

I'm not a Meta employee for implementing faiss nor a contributor of GPU implementations so below is just my estimation: I think that supporting M1 GPU is currently not planed and needed to add a lot of implementations even if M1 GPU will be supported, because faiss uses CUDA to implement GPGPU codes and CUDA can't work on M1 GPU.
That must be the hard way, but the faiss team will probably welcome to your contribution if you will implement it.

@SupreethRao99
Copy link
Author

thank you @wx257osn2 , I will rebuild FAISS with OpenBLAS and give it a try.

@mdouze
Copy link
Contributor

mdouze commented Jul 21, 2022

also, is there any way in which I can help build and upload FAISS to conda so that people will not have to build from source. Furthermore, are there plans to support GPU acceleration on M1 processors?

Right, as @wx257osn2 says, it is a significant effort to support the GPU version of FAISS in CUDA, so adding support for other types of GPUs like the M1's or Intel or AMD is not planned.

@wx257osn2
Copy link
Contributor

wx257osn2 commented Jul 22, 2022

I put a brief note about porting GPGPU codes from CUDA to other environments. Actually, the porting cost to AMD's GPU on Linux is relatively low than M1, Intel, and AMD's GPU on Windows. AMD is developing a GPGPU environment called ROCm, and HIP, which is a wrapper for CUDA and ROCm. The API of HIP is mostly one of CUDA itself, so porting cost to HIP is not so high, and HIP codes can run on both CUDA and ROCm. There are also high performance libraries and the wrappers like rocBLAS and hipBLAS, and so on. CuPy, which is a GPU-accelerated NumPy/SciPy-compatible python module, is the famous product written in HIP. CuPy had been originally written in CUDA, but that has been ported to HIP at few years ago. However, the porting cost is just relatively low , not zero. CuPy showed that even porting to HIP is not so easy.
On the other hand, the cost of porting to OpenCL (for AMD on Windows and for Intel) or Metal (for Apple) is much more high. Moreover the maintenance costs would be increased throughout the future if supporting those GPUs, because it cannot be straightforwardly integrated with CUDA implementations unlike HIP. This will be too hard. In these reason, it seems some reasonable that there is no plan beyond that contributors like us do as needed, I think.

@mdouze
Copy link
Contributor

mdouze commented Jul 26, 2022

Thanks @wx257osn2 for the overview.

To summarize, the issues for us to support alternative hardware are:

  • we need to be able to test the support with CircleCI to track regressions, ie. the appropriate hardware must be available in CircleCI

  • the stages for support are (1) compiling (2) passing tests and (3) optimizing. For step (3), unfortunately due to hardware and compiler specificities, it is not obvious that the speed of the hardware accelerator is competitive with existing accelerators, sometimes it is even slower than CPU.

  • we prioritize the hardware we work on ourselves, which is currently NVIDIA gpus.

  • and finally, we already have trouble maintaining the precompiles packages in the set of platforms we support...

So if anyone is willing to take ownership of other hardware accelerators, we'd be very happy to collaborate ;-)

@kaanbursa
Copy link

On my Mac M1, when I search on jupyter notebook it kills my session. I've tried it with python 3.7 and 3.9 both kills the session however it does work when I start a python app from pycharm or terminal. using faiss-cpu

@wx257osn2
Copy link
Contributor

@kaanbursa This issue is about the performance, not about whether it works or not. You should create your new issue about your problem with more information about installation method, error messages, and so on.

@wx257osn2
Copy link
Contributor

@SupreethRao99 Do you have any update? Has OpenBLAS helped you?

@SupreethRao99
Copy link
Author

@wx257osn2 Yes, OpenBLAS does give a good speed up. Thank you !

@nullhook
Copy link

potentially this should be ran on the gpu on m1 but this requires porting over cuda kernels to metal.

@ellesharma
Copy link

@SupreethRao99 How did you rebuild FAISS with OpenBLAS ? Newbie to GenAI stuff.

@yusufsyaifudin
Copy link

@wx257osn2 Yes, OpenBLAS does give a good speed up. Thank you !

@SupreethRao99 can you share how to rebuild FAISS using OpenBLAS and make it faster on M1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants