Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Faiss Scalar Quantization FP16 (SQfp16) and enabling SIMD (AVX2 and NEON) #1138

Closed
naveentatikonda opened this issue Sep 15, 2023 · 30 comments

Comments

@naveentatikonda
Copy link
Member

naveentatikonda commented Sep 15, 2023

Problem Statement

In k-NN plugin we mainly support vectors of type float where each dimension is 32 bits. This is getting expensive for use cases that requires ingestion on a large scale where we need to construct, load, save and search graphs(for native engines nmslib and faiss) which is getting even more costlier. Even though we have the byte vector support, it only supports lucene engine and also there is a considerable reduction in the recall when compared to float 32.

Adding support for Faiss SQFP16 helps to reduce the memory and storage footprints without compromising on recall where when user provides the 32 bit float vectors, the Faiss engine quantizes the vector into FP16 using their scalar quantizer (users don’t need to do any quantization on their end), stores it and decodes it back to FP32 while returning the results during search operations.

Faiss GitHub Issue - facebookresearch/faiss#3014
Faiss PR - facebookresearch/faiss#3166
Development Branch - https://github.com/naveentatikonda/k-NN/tree/add_sqfp16

How Does the User Experience looks like ?

In terms of UX, users just need to set the encoder as SQfp16 while creating the index using faiss engine (this works only with faiss engine) and there is no change wrt ingestion and search queries. An example is shown below:

PUT test-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "my_vector1": {
          "type": "knn_vector",
          "dimension": 16,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss",
            "parameters": {
              "encoder": {
                 "name": "sq",
                 "parameters":{
                    "type": "fp16",
                    "clip": false
                  }
              } 
              "ef_construction": 128,
              "m": 24
            }
          }
        }
    }
  }
} 

Benchmarking Results

For latest results to see the comparison between Faiss HNSW with fp32 vectors against Faiss HNSW SQfp16 - link1 and link2

Similarly, the latest results to check the comparison for IVF are mentioned here

Benchmarking on POC (Old Results and Old Observations)

Setup Configuration

Implemented a POC using the Faiss SQFp16 encoder and ran benchmarks against some of the datasets. The cluster configuration and index mapping are shown in below table.

Instance Type r5.16xlarge
Nodes 1
Primary Shards 24, 8
Replicas 0
Disk Space 1000 GB
ef_search 256
ef_construction 256
m 16
engine faiss
method hnsw

Note - These results are without a warmup operation.

Recall and Storage Results

S. No. Datasets Dimension of Vector Train Data Size Query Data Size Training Data Range Query Data Range Space Type faiss-hnsw Recall faiss sqfp16 Recall faiss-hnsw Index Size faiss sqfp16 Index Size % reduction in storage
1 gist-960-euclidean 960 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2          
2 sift-128-euclidean 128 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.99989 0.99989 1.4 gb 1.1 gb 21.4 %
3 mnist-784-euclidean 784 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 1 1 400.2 mb 310.4 mb 22.43 %
4 fashion-mnist-784-euclidean 784 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 0.99999   447.6 mb    

Indexing and Querying Results

  • Indexing Throughput (docs/s) - The number of documents per second that can be ingested
Document_Cnt / (total_index_time_s + total_refresh_time_s)
Document_Cnt/ (( ingest_took_total)_s + (refresh_index_took_total)_s)

  • Query Time (ms) - Time per query, summarized as mean, p50, p90, p99
S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw - Indexing Throughput (docs/sec) faiss sqfp16 - Indexing Throughput (docs/sec) faiss hnsw Query time p50 (ms) faiss sqfp16 Query time p50 (ms) faiss hnsw Query time p90 (ms) faiss sqfp16 Query time p90 (ms) faiss hnsw Query time p99 (ms) faiss sqfp16 Query time p99 (ms)
1 gist-960-euclidean 960 L2 24                
2 sift-128-euclidean 128 L2 24 14837 1495 10 105.6 11 124.1 12.1 136
3 mnist-784-euclidean 784 L2 24 5657 545 9.8 204.7 10.2 226.2 12.9 240.2
4 mnist-784-euclidean 784 L2 8 5065 243 6 110 7 127.9 9.7 140.6
5 fashion-mnist-784-euclidean 784 L2 24 5175   8   9   11.7  

Memory Results

S. No. Datasets Dimension of Vector Space Type faiss hnsw - Graph Memory Usage (MB) faiss sqfp16 - Graph Memory Usage (MB) % reduction in Native Memory
1 gist-960-euclidean 960 L2      
2 sift-128-euclidean 128 L2 633.51367 389.37109 38.5 %
3 mnist-784-euclidean 784 L2 188.14258 98.42188 47.7 %
4 fashion-mnist-784-euclidean 784 L2 188.14258    

Observations (Old and outdated)

  • Even after quantizing fp32 vectors to fp16 using Faiss Scalar Quantizer, we are able to get the same recall.
  • We are saving more than 21% in terms of storage.
  • In terms of memory, we are seeing a huge reduction ranging from 38% to 48% by using SQfp16 when compared to just the faiss with HNSW.
  • But, on the down side there is a huge drop in Indexing Throughput where using Faiss HNSW SQfp16 we are able to achieve only 1/10th of what we get with Faiss HNSW
  • Also, the Query Processing times looks very disappointing where it takes 10 to 20 times more for SQfp16. But, reducing the number of primary shards from 24 to 8 helped to reduce the latencies a little but it still needs improvement.

Still working on ways to identify and reduce query latency and increase indexing throughput.

with Warmup and AVX2

These are the metrics after adding warmup operation and Faiss AVX2 support to k-NN.

Indexing and Querying Results

S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw avx2- Indexing Throughput (docs/sec) faiss hnsw sqfp16 avx2- Indexing Throughput (docs/sec) faiss hnsw avx2- Query time p50 (ms) faiss hnsw sqfp16 avx2- Query time p50 (ms) faiss hnsw avx2- Query time p90 (ms) faiss hnsw sqfp16 avx2- Query time p90 (ms) faiss hnsw avx2- Query time p99 (ms) faiss hnsw sqfp16 avx2- Query time p99 (ms)
1 gist-960-euclidean 960 L2 24 1016 1059 26 25.6 29.3 28.9 32 31.1
2 sift-128-euclidean 128 L2 24 13178 13664 9.1 9 10.4 10 11.1 11
3 mnist-784-euclidean 784 L2 24 5711 5734 10 9 11 10 11.3 10.3
4 fashion-mnist-784-euclidean 784 L2 24 5325 5296 9 8 9.1 9 10.1 9.6

Observations (with AVX2)

  • Before enabling AVX2, ran few tests with warmup operation but we still don't see a good improvement in query latencies
  • With AVX2, we can see a significant improvement in the query latencies and indexing throughput for faiss hnsw with sqfp16 encoder by reducing the same memory and storage usage like we have seen in our previous test results. The new metrics are at par with faiss-hnsw results, some of the results are even better than that.
  • But, faiss hnsw with AVX2 didn't show much impact on the performance (not much improvement but no cons).
  • Also, with AVX2 we are able to achieve the same recall.

Benchmarking Results on ARM

As we have seen above, after adding the Faiss AVX2 optimization helped to improve the overall performance. But, AVX2 optimization only supports x86. As of now we don't have a similar optimization added for SQ in Faiss to support ARM. Ran few tests on ARM instances and shared the perf results below:

Recall and Storage Results

S. No. Datasets Dimension of Vector Train Data Size Query Data Size Training Data Range Query Data Range Space Type faiss-hnsw Recall faiss sqfp16 Recall faiss-hnsw Index Size faiss sqfp16 Index Size % reduction in storage
1 gist-960-euclidean 960 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2 0.99783 0.99831 16.6 gb 14.8 gb 10.84 %
2 sift-128-euclidean 128 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.99989 0.99989 1.4 gb 1.1 gb 21.4 %
3 mnist-784-euclidean 784 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 1 1 400.3 mb 310.4 mb 22.43 %
4 last-fm 65 292,385 50,000     innerproduct 0.99897 0.99857 487.8 mb 451.5 mb 7.4 %

Indexing and Querying Results

S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw - Indexing Throughput (docs/sec) faiss sqfp16 - Indexing Throughput (docs/sec) faiss sqfp16 NEON - Indexing Throughput (docs/sec) faiss hnsw - Query time p50 (ms) faiss sqfp16 - Query time p50 (ms) faiss sqfp16 NEON - Query time p50 (ms) faiss hnsw - Query time p90 (ms) faiss sqfp16 - Query time p90 (ms) faiss sqfp16 NEON - Query time p90 (ms) faiss hnsw - Query time p99 (ms) faiss sqfp16 - Query time p99 (ms) faiss sqfp16 NEON - Query time p99 (ms)
1 gist-960-euclidean 960 L2 24 983 952 997 22 73.2 34 24.1 84 38.1 25.6 88.2 40.1
2 sift-128-euclidean 128 L2 24 12869 6169 11126 8 18 10 8 20 11 9 21.9 11.1
3 sift-128-euclidean 128 L2 8 12182 4750 8992 4 10 5.4 4.9 11 6 5 12 7
4 mnist-784-euclidean 784 L2 24 4962 2910 4832 9 25 11 9.9 27 11.1 10.1 28 12.1
5 last-fm 65 innerproduct 24 9759 9845 6 6 6 6 6 7
6 last-fm 65 innerproduct 8 9515 9151 4 4 4 4 4 4

Memory Results

S. No. Datasets Dimension of Vector Space Type faiss hnsw - Graph Memory Usage (MB) faiss sqfp16 - Graph Memory Usage (MB) % reduction in Native Memory
1 gist-960-euclidean 960 L2 3807.34 1976.29 48 %
2 sift-128-euclidean 128 L2 633.51 389.37 38.5 %
3 mnist-784-euclidean 784 L2 188.14 98.42 47.7 %
4 last-fm 65 innerproduct 114.97 78.72 31.53 %

Observations

  • The recall is same with SQ.
  • The query latencies with SQ are a bit worse consuming 2 to 3.5 times more than what we see without SQ. But, far better when compared to those on x86 without AVX2 optimization.
  • The ingestion time is almost same. But, refresh and force merge are worse with SQ.
  • Like in x86, there is a significant drop in memory and storage utilization using SQ.
  • Adding SIMD optimization support using NEON for ARM helped to improve the query latencies and indexing throughput which are close to the corresponding metrics of faiss-hnsw without encoder (Except for large datasets like gist-960-euclidean, there is still a noticeable increase in query latencies).
@naveentatikonda naveentatikonda self-assigned this Sep 15, 2023
@naveentatikonda naveentatikonda changed the title Add Support for Faiss Scalar Quantization FP16 (SQfp16) [RFC] Add Support for Faiss Scalar Quantization FP16 (SQfp16) Sep 15, 2023
@naveentatikonda naveentatikonda changed the title [RFC] Add Support for Faiss Scalar Quantization FP16 (SQfp16) [RFC] Faiss Scalar Quantization FP16 (SQfp16) Sep 15, 2023
@jmazanec15
Copy link
Member

Awesome to see recall is not noticeably impacted. This provides a lot of potential for cost reduction. A couple notes:

For consistency with product quantizer, I think it might make more sense for the interface to look like:

    "encoder":{
      "name":"sq",
      "parameters":{
        "nbits": 8
      }
    }

The performance implications are unfortunate. Im guessing they have to do with repeated conversion between fp16 and fp32. Might be worth investigating if it is possible to do flops directly on fp16 data (might be compiler/processor specific).

@navneet1v
Copy link
Collaborator

@naveentatikonda thanks for sharing the results. I can see good improvements on the Memory but the latency is big concern for me. 10X extra Search latency and 1/10th indexing throughput for 1M documents are big red flags for me.

Can we do these below things:

  1. Work with Faiss maintainers to see why there is so much extra latency, may be some compiler optimizations may be missing.
  2. Can we test with lets say a 10M dataset of 128 dimension and see does the latency follow the same trend?

@naveentatikonda
Copy link
Member Author

Awesome to see recall is not noticeably impacted. This provides a lot of potential for cost reduction. A couple notes:

For consistency with product quantizer, I think it might make more sense for the interface to look like:

    "encoder":{
      "name":"sq",
      "parameters":{
        "nbits": 8
      }
    }

The performance implications are unfortunate. Im guessing they have to do with repeated conversion between fp16 and fp32. Might be worth investigating if it is possible to do flops directly on fp16 data (might be compiler/processor specific).

@jmazanec15 In terms of UX, we will add the parameters that are related to encoder, inside the encoder like in your example. But, the generic parameters like "ef_construction" related to hnsw will be outside the encoder object. Please correct me if I'm wrong.

@naveentatikonda
Copy link
Member Author

naveentatikonda commented Sep 18, 2023

@naveentatikonda thanks for sharing the results. I can see good improvements on the Memory but the latency is big concern for me. 10X extra Search latency and 1/10th indexing throughput for 1M documents are big red flags for me.

Can we do these below things:

  1. Work with Faiss maintainers to see why there is so much extra latency, may be some compiler optimizations may be missing.
  2. Can we test with lets say a 10M dataset of 128 dimension and see does the latency follow the same trend?

@navneet1v After enabling Faiss AVX2 feature, the query latencies and indexing throughput are improved with the same reduction in memory and storage utilization. I have shared the results above. Please take a look.

@luyuncheng
Copy link
Collaborator

hi, Is there any progress for this? we wanna introduce fp16 as a new codec. and we want to keep the protocol consistency with this.

@naveentatikonda
Copy link
Member Author

naveentatikonda commented Nov 20, 2023

hi, Is there any progress for this? we wanna introduce fp16 as a new codec. and we want to keep the protocol consistency with this.

@luyuncheng We have the AVX2 optimization support for x86 architecture. But, we don't have a similar optimization for ARM architecture to SQ in Faiss. So, we are working on adding NEON support to SQFP16 in faiss. We did a POC for L2 space type using NEON, which improved the overall performance and I'm working on adding the NEON support to InnerProduct.

Can you share more details about adding fp16 as a new codec ? Also, do you have the code ready ?

@luyuncheng
Copy link
Collaborator

luyuncheng commented Nov 20, 2023

@naveentatikonda we do not test fp16 at ARM architecture. but we verify the fp16 at x86 architecture. so we want to introduce this into our services. also, i want to keep the protocol as same as you did, so it can backward in the future.

so i am wondering:

  1. introduce a method name:"HNSWFP16"
    or
  2. introduce an parameter encoded like you do:
            "parameters": {
              "encoder": {
                 "name": "SQfp16"
              } 
              "ef_construction": 128,
              "m": 24
            }

@luyuncheng
Copy link
Collaborator

luyuncheng commented Nov 20, 2023

At #1139

i introduced a memory tests. with sift128 1M dataset, 8 thread. benchmark shows:

type IndexTime Build Index Memory File Size Query Index Memory Query Time
HNSWSQfp16 276939 ms 1.529 g 512 MB(-32%) 533 MB 52553 ms(+54%)
HNSWFLAT 110546 ms 1.768 g 756 MB 776 MB 33948 ms

@naveentatikonda is there any possible we can added this future and do next version for Optimize ARM articture?

@naveentatikonda
Copy link
Member Author

No @luyuncheng, can you pls explain how did you add fp16 support when you said that you will be adding it as a codec(in terms of implementation). Also, based on the results that you have shared, the performance seems to have dropped in terms of query time and indexing time. Besides that, the memory also doesn't looks like it is optimized much when compared to the metrics obtained using AVX2 optimization support for x86 architecture. (results shared by me above)

We are planning to complete NEON optimization for ARM and release this feature in OpenSearch 2.12 (in Jan, 2024). So, even though we add SQfp16 support for x86, it is not going to be released now.

@luyuncheng
Copy link
Collaborator

Besides that, the memory also doesn't looks like it is optimized much when compared to the metrics obtained using AVX2 optimization support for x86 architecture.

@naveentatikonda
i think the memory reduce size is almost same as your result shared above: 32%. (in query time).

Index performance i shared above do not using SIMD. i'll do a benchmark in avx2 with fp16 optimize.

@luyuncheng
Copy link
Collaborator

luyuncheng commented Nov 21, 2023

@naveentatikonda

hi, I did more test with avx2 in Intel(R) Xeon(R) architecture, 8thread, 128 sift 1M dataset, i think HNSWSQfp16 is pretty good in time and memory

type IndexTime Build Index Memory File Size Query Index Memory Query Time
HNSWSQfp16(avx2) 28305ms 1.535 GB 512 MB(-32%) 648 MB (-27%) 4958 ms (-9.8%)
HNSWFLAT(avx2) 35543ms 1.772 GB 756 MB 892 MB 5498 ms
HNSWSQfp16 166505 ms 1.535 GB 512 MB(-32%) 647 MB 20578 ms(+54%)
HNSWFLAT 46688ms 1.768 GB 756 MB 891 MB 9012 ms

@luyuncheng
Copy link
Collaborator

luyuncheng commented Nov 21, 2023

For consistency with product quantizer, I think it might make more sense for the interface to look like:

@jmazanec15 @naveentatikonda
SQfp16 do not need train data. is there any possible we can treat it as an hnsw encode parameters? which we can have a SQfp16 config protocol consistency.

@naveentatikonda
Copy link
Member Author

@luyuncheng I gave a thought on all your questions and tried to answer them here. Please take a look and let me know if you have any other questions.

  • Protocol and Index Mapping
    In terms of protocol or index mapping, we will be accepting input from user as a new encoder like shown below and process the request. You can maintain the same consistency while using it in your services to avoid any backwards incompatibility.
            "parameters": {
              "encoder": {
                 "name": "SQfp16"
              } 
              "ef_construction": 128,
              "m": 24
            }
  • Index Description or Implementation
    SQfp16 do not need any training. In terms of implementation, we will be setting it as an encoder with index_description as HNSW16,SQfp16 (for HNSW) while passing the request to Faiss.

  • Release Timeline and ARM Support
    Being the maintainers of k-NN repository, we cannot release partial features. So, we also want to add support for ARM architecture with optimization and release the complete SQfp16 feature (which includes both x86 and ARM support) in OpenSearch version 2.12.

cc: @vamshin @navneet1v @jmazanec15

@naveentatikonda
Copy link
Member Author

@naveentatikonda

hi, I did more test with avx2 in Intel(R) Xeon(R) architecture, 8thread, 128 sift 1M dataset, i think HNSWSQfp16 is pretty good in time and memory

type IndexTime Build Index Memory File Size Query Index Memory Query Time
HNSWSQfp16(avx2) 28305ms 1.535 GB 512 MB(-32%) 648 MB (-27%) 4958 ms (-9.8%)
HNSWFLAT(avx2) 35543ms 1.772 GB 756 MB 892 MB 5498 ms
HNSWSQfp16 166505 ms 1.535 GB 512 MB(-32%) 647 MB 20578 ms(+54%)
HNSWFLAT 46688ms 1.768 GB 756 MB 891 MB 9012 ms

@luyuncheng I'm curious to know how did you get these results. I mean did you run these tests directly on Faiss or through OpenSearch. Can you share more details ?

@luyuncheng
Copy link
Collaborator

@naveentatikonda Thanks for your reply!!

Protocol and Index Mapping
Index Description or Implementation

            "parameters": {
              "encoder": {
                 "name": "SQfp16"
              } 
              "ef_construction": 128,
              "m": 24
            }

LGTM for the protocol, thanks!!

@luyuncheng
Copy link
Collaborator

@naveentatikonda

I'm curious to know how did you get these results. I mean did you run these tests directly on Faiss or through OpenSearch.

i did it on OpenSearch.

Can you share more details

in #1139 , i added a memory tests, so it can easily using faiss_wrapper to create a new faiss index like opensearch jni did. when all native tests running ok with jni code, i added some parameter encode in the opensearch java code.

@naveentatikonda
Copy link
Member Author

Updated the following results in Description of the issue:

Benchmarking Results on ARM

As we have seen above, after adding the Faiss AVX2 optimization helped to improve the overall performance. But, AVX2 optimization only supports x86. As of now we don't have a similar optimization added for SQ in Faiss to support ARM. Ran few tests on ARM instances and shared the perf results below:

Recall and Storage Results

S. No. Datasets Dimension of Vector Train Data Size Query Data Size Training Data Range Query Data Range Space Type faiss-hnsw Recall faiss sqfp16 Recall faiss-hnsw Index Size faiss sqfp16 Index Size % reduction in storage
1 gist-960-euclidean 960 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2 0.99783 0.99831 16.6 gb 14.8 gb 10.84 %
2 sift-128-euclidean 128 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.99989 0.99989 1.4 gb 1.1 gb 21.4 %
3 mnist-784-euclidean 784 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 1 1 400.3 mb 310.4 mb 22.43 %
4 last-fm 65 292,385 50,000     innerproduct 0.99897 0.99857 487.8 mb 451.5 mb 7.4 %

Indexing and Querying Results

S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw - Indexing Throughput (docs/sec) faiss sqfp16 - Indexing Throughput (docs/sec) faiss sqfp16 NEON - Indexing Throughput (docs/sec) faiss hnsw - Query time p50 (ms) faiss sqfp16 - Query time p50 (ms) faiss sqfp16 NEON - Query time p50 (ms) faiss hnsw - Query time p90 (ms) faiss sqfp16 - Query time p90 (ms) faiss sqfp16 NEON - Query time p90 (ms) faiss hnsw - Query time p99 (ms) faiss sqfp16 - Query time p99 (ms) faiss sqfp16 NEON - Query time p99 (ms)
1 gist-960-euclidean 960 L2 24 983 952 997 22 73.2 34 24.1 84 38.1 25.6 88.2 40.1
2 sift-128-euclidean 128 L2 24 12869 6169 11126 8 18 10 8 20 11 9 21.9 11.1
3 sift-128-euclidean 128 L2 8 12182 4750 8992 4 10 5.4 4.9 11 6 5 12 7
4 mnist-784-euclidean 784 L2 24 4962 2910 4832 9 25 11 9.9 27 11.1 10.1 28 12.1
5 last-fm 65 innerproduct 24 9759 9845 6 6 6 6 6 7
6 last-fm 65 innerproduct 8 9515 9151 4 4 4 4 4 4

Memory Results

S. No. Datasets Dimension of Vector Space Type faiss hnsw - Graph Memory Usage (MB) faiss sqfp16 - Graph Memory Usage (MB) % reduction in Native Memory
1 gist-960-euclidean 960 L2 3807.34 1976.29 48 %
2 sift-128-euclidean 128 L2 633.51 389.37 38.5 %
3 mnist-784-euclidean 784 L2 188.14 98.42 47.7 %
4 last-fm 65 innerproduct 114.97 78.72 31.53 %

Observations

  • The recall is same with SQ.
  • The query latencies with SQ are a bit worse consuming 2 to 3.5 times more than what we see without SQ. But, far better when compared to those on x86 without AVX2 optimization.
  • The ingestion time is almost same. But, refresh and force merge are worse with SQ.
  • Like in x86, there is a significant drop in memory and storage utilization using SQ.
  • Adding SIMD optimization support using NEON for ARM helped to improve the query latencies and indexing throughput which are close to the corresponding metrics of faiss-hnsw without encoder (Except for large datasets like gist-960-euclidean, there is still a noticeable increase in query latencies).

@navneet1v
Copy link
Collaborator

@naveentatikonda can we run the benchmarks with 1 shard. I see that all the experiments are done with 8,24 shards. Also, can you provide what was the number of shards for the results which you have posted.

@naveentatikonda
Copy link
Member Author

@naveentatikonda can we run the benchmarks with 1 shard. I see that all the experiments are done with 8,24 shards. Also, can you provide what was the number of shards for the results which you have posted.

@navneet1v The test results that were posted above were ran with 24 shards and 8 shards (for some datasets) and 0 replicas. Now, I'm rerunning the benchmarking tests using 1 shard, 8 and 24 shards (to compare with the existing results) with the updated SQ changes in this PR facebookresearch/faiss#3141

@jmazanec15
Copy link
Member

last-fm

With dimension 65, Im not sure the code path will be hit. Could you just run one sift with IP and see what the metrics diffs are?

Ref:

  1. https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L1164
  2. https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L1190

@jmazanec15
Copy link
Member

Also @naveentatikonda can you add in description tracking faiss issue? And then also the branch you are working on your changes with?

@naveentatikonda
Copy link
Member Author

last-fm

With dimension 65, Im not sure the code path will be hit. Could you just run one sift with IP and see what the metrics diffs are?

Ref:

  1. https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L1164
  2. https://github.com/facebookresearch/faiss/blob/main/faiss/impl/ScalarQuantizer.cpp#L1190

Good catch Jack. It won't hit the SIMD logic as it isn't a multiple of 8. Will run using some sift dataset to validate the query latencies.

@naveentatikonda
Copy link
Member Author

naveentatikonda commented Dec 6, 2023

Benchmarking results on x86 architecture with AVX2 using the new logic in SQ

These are the benchmarking results on x86 architecture with AVX2 using the new logic in SQ (with inline macros and other changes using SHUFFLE) included in this PR facebookresearch/faiss#3141

Recall and Storage Results

Note - We are seeing a drop in recall below for faiss-hnsw when compared with faiss-hnsw sqfp16 because the hyper parameter efConstruction used for sqfp16 is 512 whereas for faiss-hnsw the value was 256. Higher the value of efConstruction leads to more accurate graph and better recall.

S. No. Datasets Dimension of Vector Primary Shards Train Data Size Query Data Size Training Data Range Query Data Range Space Type faiss-hnsw Recall faiss-hnsw sqfp16 Recall faiss-hnsw Index Size faiss-hnsw sqfp16 Index Size % reduction in storage
1 gist-960-euclidean 960 1 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2 0.93751 0.95368 16.6 gb 14.8 gb 10.84 %
2 gist-960-euclidean 960 24 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2 0.99779 0.99831 16.6 gb 14.8 gb 10.84 %
3 sift-128-euclidean 128 1 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.99826 0.99867 1.4 gb 1.1 gb 21.4 %
4 sift-128-euclidean 128 24 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.99989 0.99989 1.4 gb 1.1 gb 21.4 %
5 mnist-784-euclidean 784 1 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 0.99987 0.99993 400.1 mb 310.4 mb 22.43 %
6 mnist-784-euclidean 784 24 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 1 1 400.1 mb 310.4 mb 22.43 %

Indexing and Querying Results

S. No. Datasets Dimension of Vector Primary Shards Space Type faiss-hnsw avx2 - Indexing Throughput faiss sqfp16 avx2 - Indexing Throughput faiss-hnsw avx2 - Query time p50 (ms) faiss sqfp16 avx2 - Query time p50 (ms) faiss-hnsw avx2 - Query time p90 (ms) faiss sqfp16 avx2 - Query time p90 (ms) faiss-hnsw avx2 - Query time p99 (ms) faiss sqfp16 avx2 - Query time p99 (ms)
1 gist-960-euclidean 960 1 L2 683 686 34 35.7 35 36.7 37 39.7
2 gist-960-euclidean 960 24 L2 1035 1007 24.5 25.1 28 28.3 30.6 30.5
3 sift-128-euclidean 128 1 L2 4722 3269 7.1 7 8 8 8.5 8.1
4 sift-128-euclidean 128 24 L2 13128 12690 8 7.2 9 8.2 9.9 9.1
5 mnist-784-euclidean 784 1 L2 2921 2767 12.1 12 13 13 13.2 13.1
6 mnist-784-euclidean 784 24 L2 5759 5796 8 8 9 8 10 9.1

Also, the old benchmarking results are posted below for comparison:

Indexing and Querying Results (Old Results)

S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw avx2- Indexing Throughput (docs/sec) faiss hnsw sqfp16 avx2- Indexing Throughput (docs/sec) faiss hnsw avx2- Query time p50 (ms) faiss hnsw sqfp16 avx2- Query time p50 (ms) faiss hnsw avx2- Query time p90 (ms) faiss hnsw sqfp16 avx2- Query time p90 (ms) faiss hnsw avx2- Query time p99 (ms) faiss hnsw sqfp16 avx2- Query time p99 (ms)
1 gist-960-euclidean 960 L2 24 1016 1059 26 25.6 29.3 28.9 32 31.1
2 sift-128-euclidean 128 L2 24 13178 13664 9.1 9 10.4 10 11.1 11
3 mnist-784-euclidean 784 L2 24 5711 5734 10 9 11 10 11.3 10.3
4 fashion-mnist-784-euclidean 784 L2 24 5325 5296 9 8 9.1 9 10.1 9.6

Observations

  • There is a slight improvement in query latencies by 1 to 2 ms when compared with old benchmarking results.

cc: @navneet1v @jmazanec15

@navneet1v
Copy link
Collaborator

Thanks @naveentatikonda . Seems like the new changes helped reduce the latency further.

@jmazanec15
Copy link
Member

This looks good to me. Lets go ahead and start trying to get this change into faiss! Nice work!

@naveentatikonda
Copy link
Member Author

Created a draft PR to faiss repo. Pls take a look.
facebookresearch/faiss#3166

cc: @vamshin @navneet1v @jmazanec15

@naveentatikonda naveentatikonda changed the title [RFC] Faiss Scalar Quantization FP16 (SQfp16) [RFC] Faiss Scalar Quantization FP16 (SQfp16) and enabling SIMD (AVX2 and NEON) Feb 5, 2024
@naveentatikonda naveentatikonda moved this from 2.12.0 to 2.13.0 in Vector Search RoadMap Feb 6, 2024
@naveentatikonda
Copy link
Member Author

Benchmarking results on x86 architecture with AVX2 using the new logic in SQ

These are the benchmarking results on x86 architecture with AVX2 using the new logic in SQ (with inline macros and other changes using SHUFFLE) included in this PR facebookresearch/faiss#3141

Recall and Storage Results

Note - We are seeing a drop in recall below for faiss-hnsw when compared with faiss-hnsw sqfp16 because the hyper parameter efConstruction used for sqfp16 is 512 whereas for faiss-hnsw the value was 256. Higher the value of efConstruction leads to more accurate graph and better recall.

As mentioned in the note, we have seen a drop in recall for faiss-hnsw due to the hyperparamter efConstruction. The updated recall values are listed below using the same default value (512). The below results also include the metrics comparison for cohere-wiki-simple-embeddings-768-l2 dataset.

S. No. Datasets Dimension of Vector Primary Shards Train Data Size Query Data Size Training Data Range Query Data Range Space Type faiss-hnsw Recall faiss-hnsw sqfp16 Recall faiss-hnsw Index Size faiss-hnsw sqfp16 Index Size % reduction in storage
1 gist-960-euclidean 960 1 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2 0.95378 0.95368 16.6 gb 14.8 gb 10.84 %
2 gist-960-euclidean 960 24 1,000,000 1,000 [ 0.0, 1.4772] [ 0.0, 0.7285 ] L2 0.99863 0.99831 16.6 gb 14.8 gb 10.84 %
3 sift-128-euclidean 128 1 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.9987 0.99867 1.4 gb 1.1 gb 21.4 %
4 sift-128-euclidean 128 24 1,000,000 10,000 [ 0.0, 218.0 ] [ 0.0, 184.0 ] L2 0.99989 0.99989 1.4 gb 1.1 gb 21.4 %
5 mnist-784-euclidean 784 1 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 0.99993 0.99993 400.1 mb 310.4 mb 22.43 %
6 mnist-784-euclidean 784 24 60, 000 10,000 [ 0.0, 255.0 ] [ 0.0, 255.0 ] L2 1 1 400.1 mb 310.4 mb 22.43 %
7 cohere-wiki-simple-embeddings-768 768 1 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 0.98625 0.98616 8 gb 7.4 gb 0.075 %
8 cohere-wiki-simple-embeddings-768 768 24 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 0.99641 0.99616 8 gb 7.4 gb 0.075 %

Indexing, Querying and Memory Results of Cohere Dataset

S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw avx2- Indexing Throughput (docs/sec) faiss hnsw sqfp16 avx2- Indexing Throughput (docs/sec) faiss hnsw avx2- Query time p50 (ms) faiss hnsw sqfp16 avx2- Query time p50 (ms) faiss hnsw avx2- Query time p90 (ms) faiss hnsw sqfp16 avx2- Query time p90 (ms) faiss hnsw avx2- Query time p99 (ms) faiss hnsw sqfp16 avx2- Query time p99 (ms)
1 cohere-wiki-simple-embeddings-768 768 L2 1 782 792 33 33 34 34.4 35 35.8
2 cohere-wiki-simple-embeddings-768 768 L2 24 1175 1190 21 20 23 22 24.2 24
S. No. Datasets Dimension of Vector Space Type faiss hnsw - Graph Memory Usage (MB) faiss-hnsw sqfp16 - Graph Memory Usage (MB) % reduction in Native Memory
1 cohere-wiki-simple-embeddings-768 768 L2 1463.1992 766.1436 47.6 %

@naveentatikonda
Copy link
Member Author

Benchmarking Results Comparison Between Faiss HNSW with AVX2 and Faiss HNSW SQFP16 with AVX2 for ms_marco-1m and cohere-768-1m-IP datsets

Recall Results

S. No. Datasets Dimension of Vector Primary Shards Train Data Size Query Data Size Training Data Range Query Data Range Space Type faiss-hnsw Recall faiss-hnsw sqfp16 Recall
1 ms_marco 384 1 1,000,000 50,000 [ -0.31676134, 0.31685343 ] [ -0.2875412, 0.30342942 ] L2 0.93597 0.93597
2 ms_marco 384 24 1,000,000 50,000 [ -0.31676134, 0.31685343 ] [ -0.2875412, 0.30342942 ] L2 0.98998 0.98979
3 cohere-768-IP 768 1 1,000,000 10,000 [ -4.1073565, 5.504557 ] [ -4.109505, 5.4809895 ] InnerProduct 0.76391 0.76324
4 cohere-768-IP 768 8 1,000,000 10,000 [ -4.1073565, 5.504557 ] [ -4.109505, 5.4809895 ] InnerProduct 0.94299 0.94214
5 cohere-768-IP 768 24 1,000,000 10,000 [ -4.1073565, 5.504557 ] [ -4.109505, 5.4809895 ] InnerProduct 0.97981 0.97863

Indexing and Querying Results

S. No. Datasets Dimension of Vector Space Type Primary Shards faiss hnsw avx2- Indexing Throughput (docs/sec) faiss hnsw sqfp16 avx2- Indexing Throughput (docs/sec) faiss hnsw avx2- Query time p50 (ms) faiss hnsw sqfp16 avx2- Query time p50 (ms) faiss hnsw avx2- Query time p90 (ms) faiss hnsw sqfp16 avx2- Query time p90 (ms) faiss hnsw avx2- Query time p99 (ms) faiss hnsw sqfp16 avx2- Query time p99 (ms)
1 ms_marco 384 L2 1 1571 1556 15 15 15 15 16.5 16
2 ms_marco 384 L2 24 2772 2618 7 7 7 8 10.5 8
3 cohere-768-IP 768 InnerProduct 1 844 850 28 28 28 29 30.5 30
4 cohere-768-IP 768 InnerProduct 8 1271 1294 8 8 10 10 12 12
5 cohere-768-IP 768 InnerProduct 24 1437 1446 10 9 11 11 12 12

Memory Results

S. No. Datasets Dimension of Vector Space Type faiss hnsw - Graph Memory Usage (MB) faiss-hnsw sqfp16 - Graph Memory Usage (MB) % reduction in Native Memory
1 ms_marco 384 L2 1610 877.65 45.5 %
2 cohere-768-IP 768 InnerProduct 3074.98 1610 47.64 %

@jmazanec15
Copy link
Member

@naveentatikonda those numbers look really good

@naveentatikonda
Copy link
Member Author

Benchmarking Results Comparison Between Faiss IVF with AVX2 and Faiss IVF SQFP16 with AVX2 for cohere-wiki-simple-embeddings-768, ms_marco-1m and cohere-768-1m-IP datsets

Recall Results

S. No. Datasets Dimension of Vector Primary Shards Train Data Size Query Data Size Training Data Range Query Data Range Space Type nlist nprobes faiss-ivf Recall faiss-ivf sqfp16 Recall
1 cohere-wiki-simple-embeddings-768 768 1 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 4 2 0.93446 0.93847
2 cohere-wiki-simple-embeddings-768 768 1 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 128 4 0.77699 0.7956
3 cohere-wiki-simple-embeddings-768 768 1 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 128 8 0.88069 0.88745
4 cohere-wiki-simple-embeddings-768 768 1 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 512 8 0.81003 0.80722
5 cohere-wiki-simple-embeddings-768 768 8 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 128 8 0.88069 0.88745
6 cohere-wiki-simple-embeddings-768 768 24 475,858 10,000 [ -4.1561704, 5.5478516 ] [ -4.065383, 5.4902344 ] L2 4 2 0.93588 0.91818
7 ms_marco 384 1 1,000,000 50,000 [ -0.31676134, 0.31685343 ] [ -0.2875412, 0.30342942 ] L2 64 4 0.79061 0.77586
8 ms_marco 384 1 1,000,000 50,000 [ -0.31676134, 0.31685343 ] [ -0.2875412, 0.30342942 ] L2 128 4 0.74512 0.74619
9 ms_marco 384 1 1,000,000 50,000 [ -0.31676134, 0.31685343 ] [ -0.2875412, 0.30342942 ] L2 128 8 0.83593 0.83357
10 ms_marco 384 24 1,000,000 50,000 [ -0.31676134, 0.31685343 ] [ -0.2875412, 0.30342942 ] L2 128 8 0.83727 0.83758
11 cohere-768-IP 768 1 1,000,000 10,000 [ -4.1073565, 5.504557 ] [ -4.109505, 5.4809895 ] InnerProduct 128 8 0.79611 0.79028
12 cohere-768-IP 768 8 1,000,000 10,000 [ -4.1073565, 5.504557 ] [ -4.109505, 5.4809895 ] InnerProduct 128 8 0.78425 0.79671
13 cohere-768-IP 768 24 1,000,000 10,000 [ -4.1073565, 5.504557 ] [ -4.109505, 5.4809895 ] InnerProduct 128 8 0.80341 0.80161

Indexing and Querying Results

S. No. Datasets Dimension of Vector Space Type Primary Shards nlist nprobes faiss ivf avx2- Indexing Throughput (docs/sec) faiss ivf sqfp16 avx2- Indexing Throughput (docs/sec) faiss ivf avx2- Query time p50 (ms) faiss ivf sqfp16 avx2- Query time p50 (ms) faiss ivf avx2- Query time p90 (ms) faiss ivf sqfp16 avx2- Query time p90 (ms) faiss ivf avx2- Query time p99 (ms) faiss ivf sqfp16 avx2- Query time p99 (ms)
1 cohere-wiki-simple-embeddings-768 768 L2 1 4 2 676 701 155.5 113 175.5 125 188.5 131.5
2 cohere-wiki-simple-embeddings-768 768 L2 1 128 4 692 494 42 36 50 40 54 43
3 cohere-wiki-simple-embeddings-768 768 L2 1 128 8 691 493 52 40 62 45 68.5 49
4 cohere-wiki-simple-embeddings-768 768 L2 1 512 8 684 514 40 33 46.5 37.5 51.5 39.5
5 cohere-wiki-simple-embeddings-768 768 L2 8 128 8 962 1081 13 10 16.5 12 19 14
6 cohere-wiki-simple-embeddings-768 768 L2 24 4 2 1031 1048 33.5 21 37 22.5 40 24
7 ms_marco 384 L2 1 64 4 1000 1042 30 23 35 25.5 39 28
8 ms_marco 384 L2 1 128 4 1014 1045 23 19 26 20 28 21
9 ms_marco 384 L2 1 128 8 1015 1041 29.5 22.5 33 25 36 26.5
10 ms_marco 384 L2 24 128 8 1981 2266 6 5 7 6 8 6
11 cohere-768-IP 768 InnerProduct 1 128 8 599 680 72 52.5 97.5 64.5 113.5 74.5
12 cohere-768-IP 768 InnerProduct 8 128 8 969 1096 19 13 25 16 29 19
13 cohere-768-IP 768 InnerProduct 24 128 8 1036 1182 16 11 23 14 26 17

Memory Results

S. No. Datasets Dimension of Vector Space Type faiss ivf - Graph Memory Usage (MB) faiss-ivf sqfp16 - Graph Memory Usage (MB) % reduction in Native Memory
1 cohere-wiki-simple-embeddings-768 768 L2 1408.02 710.96 49.51 %
1 ms_marco 384 L2 1492.27 759.85 49.08 %
2 cohere-768-IP 768 InnerProduct 2961.62 1496.77 49.46 %

@github-project-automation github-project-automation bot moved this from 2.13.0 to ✅ Done in Vector Search RoadMap Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants