-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring HNSW to use a new internal FlatVectorFormat #12729
Refactoring HNSW to use a new internal FlatVectorFormat #12729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of improving the composability, but I would prefer it to be more an implementation detail and not something that users get exposed to. In practice, this would mean making the following changes to your PR:
FlatVectorsFormat
no longer extendsKnnVectorsFormat
(and removes some methods like getMaxDimension).FlatVectorsFormat
and its sub classes no longer get registered in SPI.- Whichever flat format is used in a non-flat format is something that's hardcoded in the
KnnVectorsFormat
as opposed to something that is configurable and serialized in the metadata so that the right flat format can be loaded at search time. Lucene99HnswVectorsFormat
gets split intoLucene99HnswVectorsFormat
andLucene99HnswScalarQuantizedVectorsFormat
. Users now choose whether to use quantization inPerFieldKnnVectorsFormat
rather than withinLucene99HnswVectorsFormat
itself.
I realize that there is a counter argument for allowing everyone to create their own knn vectors format by configuring independently the indexing mechanism (e.g. HNSW vs. IVF), quantization (e.g. raw vs. scalar quantized) and maybe a couple other things, similarly to the DFR similarity. But then maybe we should consider diverging the APIs of the "flat" and indexing knn vectors formats further (thinking out loud).
return null; | ||
public QuantizedByteVectorValues getQuantizedVectorValues(String field) throws IOException { | ||
if (flatVectorsReader instanceof QuantizedVectorsReader) { | ||
return ((QuantizedVectorsReader) flatVectorsReader).getQuantizedVectorValues(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have that method on the flat vectors reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump: this sounds like a good addition to the flat vectors reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz I honestly don't know. It is sort of like having getHnswGraph
part of the KnnVectorsReader
. It assumes that this is a thing that all readers will need. The only thing that needs this getQuantizedVectorValues
is something that is merging int8
quantized values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough.
@jpountz the goal of this change is not just making code reusable. But:
I suppose your suggestion around separate HNSW formats with flat storage hardcoded can still satisfy my second point, as long as the classes are usable. But, I would really like to not lose the first goal, making quantization available to users who want flat vector storage. |
OK, @jpountz thinking about it more. To do what you are suggesting, I think the following would work:
In a separate PR I guess we can talk about a "flat" vector codec and if that is desired or not. The "*Flat" things now get users very far, they just need to have a custom codec that provides a couple lines of code to forward the searching, etc. |
Thanks, splitting the way you describe would make me happy. I had not understood that the flat codec was a goal. Now that I think more about it, I wonder if we should better separate storage and indexing of vectors in FieldType/FieldInfo and file formats. E.g. it's a bit awkward that a |
@jpountz updated. Flat is no longer pluggable, two HNSW formats are exposed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's a big change so it's hard to review in detail, but the approach looks good to me.
return null; | ||
public QuantizedByteVectorValues getQuantizedVectorValues(String field) throws IOException { | ||
if (flatVectorsReader instanceof QuantizedVectorsReader) { | ||
return ((QuantizedVectorsReader) flatVectorsReader).getQuantizedVectorValues(field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump: this sounds like a good addition to the flat vectors reader?
I agree with Adrien that hardcoded formats with a clear strategy are better. |
@jimczi what do you mean "existing format as implementation detail"? The flat format is an implementation detail. Folks using the quantized hnsw do not have to supply a flat format. And by "register format" do you mean we should register the flat formats? Right now they are just classes that encapsulate vector storage and merging. |
We can register the flat format for direct usage (outside of HNSW) and use them in the HNSW formats as an implementation detail? My point is that we don't need the |
@jimczi the HNSWWriter and Readers need the passed flat vector readers and writers to provide specific functions. Like the mergeOneField that returns closeable scorers. I am not sure all knnVectorReaders/Writers should have that method. I think we should expose the flat formats in the codec. But the required new functions for indexing the vectors seem to justify a new abstraction. Or do you think they are just KnnReader/Writers and we update that base abstraction? |
Can we add the abstraction as an extension of the existing knnVectorFormat/Readers/Writers? So instead of adding the functions to the base class, we introduce the flat format as an abstract class that extends the base one. |
@jimczi thinking about it more, it seems to be a Flat* index format for vector search will require a different API. Right now, kNN search assumes the user provides pre-filters. However, when doing a flat vector search, there is no such thing as a "pre-filter". In fact, a flat vector search should behave like a typical scorer, iterating over matching documents and scoring them. The question then becomes, could we just use the We might be able to, but this could be a bigger change and might not buy a big improvement. It seems counter intuitive to de-quantize every vector back into a It almost seems like these flat formats should return a Since this is a larger API discussion, do we think we can move forward with the way it is now (quantization for HNSW and other vector indices) and iterate on exposing flat vectors in a separate line of work? @jimczi @jpountz if y'all have ideas around how the API could look, please let me know! |
Sorry for the late reply.
Sure that's fine with me, can you update the title of this PR then to reflect what it's really doing? If we're not adding the flat format in the PR, it's like a cleanup to make code more reusable that don't affect the format, right? |
@jimczi updated the title. |
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type. This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats. Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`. Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`
A change in Lucene99HnswVectorsFormat requires that we adapt our code, see apache/lucene#12729
* Resolve compile error in DenseVectorFieldMapper A change in Lucene99HnswVectorsFormat requires that we adapt our code, see apache/lucene#12729 * Add new Lucene file extensions * Fixing format name check --------- Co-authored-by: Benjamin Trent <4357155+benwtrent@users.noreply.github.com>
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.
This PR extracts out the vector storage into a new format
Lucene99FlatVectorsFormat
and adds new base class calledFlatVectorsFormat
. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.Additionally, this PR refactors the new
Lucene99ScalarQuantizedVectorsFormat
to be aFlatVectorsFormat
.Now,
Lucene99HnswVectorsFormat
is constructed with aLucene99FlatVectorsFormat
and a newLucene99HnswScalarQuantizedVectorsFormat
that usesLucene99ScalarQuantizedVectorsFormat