-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve int4 compressed comparisons performance #13321
Improve int4 compressed comparisons performance #13321
Conversation
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorScorer.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is very nice. I left a few initial small comments.
lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java
Show resolved
Hide resolved
...core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99ScalarQuantizedVectorScorer.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Separately, we know whether or not the vectors are packed when creating the scorer, in which case why not just pack the query vector, allowing to do the comparison more easily - the query vector can be coerced to match that of what is in the values. But this requires a little more separation, which can follow if worth it.
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorScorer.java
Outdated
Show resolved
Hide resolved
Because this makes things measurably slower. Having to decompressing only one of the vectors, and not adding the cost of compressing the query vector is the reason why this is so fast. And I am not sure how it makes the comparisons easier. We then have to add LoC to decompress both and then compare both. The iteration code, etc. all remains the same. [apologies - I incorrectly edited this comment when trying to reply to it. It should be restored now.] |
Ok. That's a very good reason! ;-)
Right. It's just that the masking and shifting would apply equally to each of the inputs. And we'd load less data - since the number of bytes in the query would be less. I'm perfectly ok with the code as it is, I was mainly ensuring that such was considered (which is was). |
panama logic looks good to me. Not for this PR, just a question, is it possible to use this same trick for 8 bit integer calculations too? I imagine this trick could probably help there as well, if it is safe.
|
I went over the Panama code again, and confused myself about the potential to overflow - there is no issue, but it was not obvious as one has to match the number of potential values (not bytes) to the number of lanes x accumulators. Anyway, I added a test for boundary conditions - which will help when we move to optimising wider bit-widths. |
Some napkin math (this would need to be double checked...) For For 128 bitwidth fits 8 shorts. In In |
OK, I ran on Google's ARM machine (
Something else funny is that this is almost at the same speed as Micro-benchmarks are VERY close to float. This means the reduction in bytes read & parsing will make int4 much faster than float.
|
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
This is a great optimisation for int4, but won't work for int8. We can get a bit more perf out of int4 on AVX 2 and AVX 512, by applying a similar same pattern. AVX 256 A maximum of 2320 dimensions - each lane summing 2320/16=145 multiplications. Let's set an inner loop bound of 2048. before
after
AVX 512 A maximum of 4640 dimensions - each lane summing 4640/32=145 multiplications. Let's set an inner loop bound of 4096. before
after
|
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
I double checked @ChrisHegarty 's improvements on 512: My previous numbers:
Here are the new numbers (I include float32 and regular binaryVector for comparisons):
|
@uschindler I addressed your comments. |
To me changes look fine. For discussion: In my opinion the conditional code falling back to default code if neither is compressed should possibly be moved to the VectorUtil class and i think them the booleans could be removed and we only have two variants directly called from VectorUtil. Uwe |
This updates the int4 dot-product comparison to have an optimized one for when one of the vectors are compressed (the most common search case). This change actually makes the compressed search on ARM faster than the uncompressed. However, on AVX512/256, it still slightly slower than uncompressed, but it still much faster now with this optimization than before (eagerly decompressing). This optimized is tied tightly with how the vectors are actually compressed and stored, consequently, I added a new scorer that is within the lucene99 codec. So, this gives us 8x reduction over float32, well more than 2x faster queries than float32, and no need to rerank as the recall and accuracy are excellent.
This updates the int4 dot-product comparison to have an optimized one for when one of the vectors are compressed (the most common search case). This change actually makes the compressed search on ARM faster than the uncompressed. However, on AVX512/256, it still slightly slower than uncompressed, but it still much faster now with this optimization than before (eagerly decompressing).
This optimized is tied tightly with how the vectors are actually compressed and stored, consequently, I added a new scorer that is within the lucene99 codec.
So, this gives us 8x reduction over float32, well more than 2x faster queries than float32, and no need to rerank as the recall and accuracy are excellent.
Here are some lucene-util numbers over CohereV3 at 1024 dimensions:
New compressed numbers on ARM
Compared with uncompressed on ARM:
Here are some JMH numbers as well (note, I am excluding odd number of indices as these don't support compression).
NOTE:
PackedUnpacked
is eagerly decompressing the vectors and then using dot-product, what is occurring now.ARM:
AVX512: