-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Faiss Scalar Quantization FP16 (SQfp16) and enabling SIMD (AVX2 and NEON) #1138
Comments
Awesome to see recall is not noticeably impacted. This provides a lot of potential for cost reduction. A couple notes: For consistency with product quantizer, I think it might make more sense for the interface to look like:
The performance implications are unfortunate. Im guessing they have to do with repeated conversion between fp16 and fp32. Might be worth investigating if it is possible to do flops directly on fp16 data (might be compiler/processor specific). |
@naveentatikonda thanks for sharing the results. I can see good improvements on the Memory but the latency is big concern for me. 10X extra Search latency and 1/10th indexing throughput for 1M documents are big red flags for me. Can we do these below things:
|
@jmazanec15 In terms of UX, we will add the parameters that are related to encoder, inside the encoder like in your example. But, the generic parameters like "ef_construction" related to hnsw will be outside the encoder object. Please correct me if I'm wrong. |
@navneet1v After enabling Faiss AVX2 feature, the query latencies and indexing throughput are improved with the same reduction in memory and storage utilization. I have shared the results above. Please take a look. |
hi, Is there any progress for this? we wanna introduce fp16 as a new codec. and we want to keep the protocol consistency with this. |
@luyuncheng We have the AVX2 optimization support for x86 architecture. But, we don't have a similar optimization for ARM architecture to SQ in Faiss. So, we are working on adding NEON support to SQFP16 in faiss. We did a POC for L2 space type using NEON, which improved the overall performance and I'm working on adding the NEON support to InnerProduct. Can you share more details about adding fp16 as a new codec ? Also, do you have the code ready ? |
@naveentatikonda we do not test fp16 at ARM architecture. but we verify the fp16 at x86 architecture. so we want to introduce this into our services. also, i want to keep the protocol as same as you did, so it can backward in the future. so i am wondering:
|
At #1139 i introduced a memory tests. with sift128 1M dataset, 8 thread. benchmark shows:
@naveentatikonda is there any possible we can added this future and do next version for Optimize ARM articture? |
No @luyuncheng, can you pls explain how did you add fp16 support when you said that you will be adding it as a codec(in terms of implementation). Also, based on the results that you have shared, the performance seems to have dropped in terms of query time and indexing time. Besides that, the memory also doesn't looks like it is optimized much when compared to the metrics obtained using AVX2 optimization support for x86 architecture. (results shared by me above) We are planning to complete NEON optimization for ARM and release this feature in OpenSearch 2.12 (in Jan, 2024). So, even though we add SQfp16 support for x86, it is not going to be released now. |
@naveentatikonda Index performance i shared above do not using SIMD. i'll do a benchmark in avx2 with fp16 optimize. |
hi, I did more test with avx2 in Intel(R) Xeon(R) architecture, 8thread, 128 sift 1M dataset, i think HNSWSQfp16 is pretty good in time and memory
|
@jmazanec15 @naveentatikonda |
@luyuncheng I gave a thought on all your questions and tried to answer them here. Please take a look and let me know if you have any other questions.
|
@luyuncheng I'm curious to know how did you get these results. I mean did you run these tests directly on Faiss or through OpenSearch. Can you share more details ? |
@naveentatikonda Thanks for your reply!!
LGTM for the protocol, thanks!! |
i did it on OpenSearch.
in #1139 , i added a memory tests, so it can easily using |
Updated the following results in Description of the issue: Benchmarking Results on ARMAs we have seen above, after adding the Faiss AVX2 optimization helped to improve the overall performance. But, AVX2 optimization only supports x86. As of now we don't have a similar optimization added for SQ in Faiss to support ARM. Ran few tests on ARM instances and shared the perf results below: Recall and Storage Results
Indexing and Querying Results
Memory Results
Observations
|
@naveentatikonda can we run the benchmarks with 1 shard. I see that all the experiments are done with 8,24 shards. Also, can you provide what was the number of shards for the results which you have posted. |
@navneet1v The test results that were posted above were ran with 24 shards and 8 shards (for some datasets) and 0 replicas. Now, I'm rerunning the benchmarking tests using 1 shard, 8 and 24 shards (to compare with the existing results) with the updated SQ changes in this PR facebookresearch/faiss#3141 |
With dimension 65, Im not sure the code path will be hit. Could you just run one sift with IP and see what the metrics diffs are? Ref: |
Also @naveentatikonda can you add in description tracking faiss issue? And then also the branch you are working on your changes with? |
Good catch Jack. It won't hit the SIMD logic as it isn't a multiple of 8. Will run using some sift dataset to validate the query latencies. |
Benchmarking results on x86 architecture with AVX2 using the new logic in SQThese are the benchmarking results on x86 architecture with AVX2 using the new logic in SQ (with inline macros and other changes using SHUFFLE) included in this PR facebookresearch/faiss#3141 Recall and Storage ResultsNote - We are seeing a drop in recall below for
Indexing and Querying Results
Also, the old benchmarking results are posted below for comparison: Indexing and Querying Results (Old Results)
Observations
|
Thanks @naveentatikonda . Seems like the new changes helped reduce the latency further. |
This looks good to me. Lets go ahead and start trying to get this change into faiss! Nice work! |
Created a draft PR to faiss repo. Pls take a look. |
As mentioned in the
Indexing, Querying and Memory Results of Cohere Dataset
|
Benchmarking Results Comparison Between Faiss HNSW with AVX2 and Faiss HNSW SQFP16 with AVX2 for ms_marco-1m and cohere-768-1m-IP datsetsRecall Results
Indexing and Querying Results
Memory Results
|
@naveentatikonda those numbers look really good |
Benchmarking Results Comparison Between Faiss IVF with AVX2 and Faiss IVF SQFP16 with AVX2 for cohere-wiki-simple-embeddings-768, ms_marco-1m and cohere-768-1m-IP datsetsRecall Results
Indexing and Querying Results
Memory Results
|
Problem Statement
In k-NN plugin we mainly support vectors of type float where each dimension is 32 bits. This is getting expensive for use cases that requires ingestion on a large scale where we need to construct, load, save and search graphs(for native engines nmslib and faiss) which is getting even more costlier. Even though we have the byte vector support, it only supports lucene engine and also there is a considerable reduction in the recall when compared to float 32.
Adding support for Faiss SQFP16 helps to reduce the memory and storage footprints without compromising on recall where when user provides the 32 bit float vectors, the Faiss engine quantizes the vector into FP16 using their scalar quantizer (users don’t need to do any quantization on their end), stores it and decodes it back to FP32 while returning the results during search operations.
Faiss GitHub Issue - facebookresearch/faiss#3014
Faiss PR - facebookresearch/faiss#3166
Development Branch - https://github.com/naveentatikonda/k-NN/tree/add_sqfp16
How Does the User Experience looks like ?
In terms of UX, users just need to set the encoder as SQfp16 while creating the index using faiss engine (this works only with faiss engine) and there is no change wrt ingestion and search queries. An example is shown below:
Benchmarking Results
For latest results to see the comparison between Faiss HNSW with fp32 vectors against Faiss HNSW SQfp16 - link1 and link2
Similarly, the latest results to check the comparison for IVF are mentioned here
Benchmarking on POC (Old Results and Old Observations)
Setup Configuration
Implemented a POC using the Faiss SQFp16 encoder and ran benchmarks against some of the datasets. The cluster configuration and index mapping are shown in below table.
Note - These results are without a warmup operation.
Recall and Storage Results
Indexing and Querying Results
Document_Cnt / (total_index_time_s + total_refresh_time_s)
Document_Cnt/ (( ingest_took_total)_s + (refresh_index_took_total)_s)
Memory Results
Observations (Old and outdated)Even after quantizing fp32 vectors to fp16 using Faiss Scalar Quantizer, we are able to get the same recall.We are saving more than 21% in terms of storage.In terms of memory, we are seeing a huge reduction ranging from 38% to 48% by using SQfp16 when compared to just the faiss with HNSW.But, on the down side there is a huge drop in Indexing Throughput where using Faiss HNSW SQfp16 we are able to achieve only 1/10th of what we get with Faiss HNSWAlso, the Query Processing times looks very disappointing where it takes 10 to 20 times more for SQfp16. But, reducing the number of primary shards from 24 to 8 helped to reduce the latencies a little but it still needs improvement.Still working on ways to identify and reduce query latency and increase indexing throughput.
with Warmup and AVX2
These are the metrics after adding warmup operation and Faiss AVX2 support to k-NN.
Indexing and Querying Results
Observations (with AVX2)
Benchmarking Results on ARM
As we have seen above, after adding the Faiss AVX2 optimization helped to improve the overall performance. But, AVX2 optimization only supports x86. As of now we don't have a similar optimization added for SQ in Faiss to support ARM. Ran few tests on ARM instances and shared the perf results below:
Recall and Storage Results
Indexing and Querying Results
Memory Results
Observations
The text was updated successfully, but these errors were encountered: