Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring HNSW to use a new internal FlatVectorFormat #12729

Merged
merged 14 commits into from
Nov 10, 2023

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Oct 27, 2023

Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.

This PR extracts out the vector storage into a new format Lucene99FlatVectorsFormat and adds new base class called FlatVectorsFormat. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.

Additionally, this PR refactors the new Lucene99ScalarQuantizedVectorsFormat to be a FlatVectorsFormat.

Now, Lucene99HnswVectorsFormat is constructed with a Lucene99FlatVectorsFormat and a new Lucene99HnswScalarQuantizedVectorsFormat that uses Lucene99ScalarQuantizedVectorsFormat

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of improving the composability, but I would prefer it to be more an implementation detail and not something that users get exposed to. In practice, this would mean making the following changes to your PR:

  • FlatVectorsFormat no longer extends KnnVectorsFormat (and removes some methods like getMaxDimension).
  • FlatVectorsFormat and its sub classes no longer get registered in SPI.
  • Whichever flat format is used in a non-flat format is something that's hardcoded in the KnnVectorsFormat as opposed to something that is configurable and serialized in the metadata so that the right flat format can be loaded at search time.
  • Lucene99HnswVectorsFormat gets split into Lucene99HnswVectorsFormat and Lucene99HnswScalarQuantizedVectorsFormat. Users now choose whether to use quantization in PerFieldKnnVectorsFormat rather than within Lucene99HnswVectorsFormat itself.

I realize that there is a counter argument for allowing everyone to create their own knn vectors format by configuring independently the indexing mechanism (e.g. HNSW vs. IVF), quantization (e.g. raw vs. scalar quantized) and maybe a couple other things, similarly to the DFR similarity. But then maybe we should consider diverging the APIs of the "flat" and indexing knn vectors formats further (thinking out loud).

return null;
public QuantizedByteVectorValues getQuantizedVectorValues(String field) throws IOException {
if (flatVectorsReader instanceof QuantizedVectorsReader) {
return ((QuantizedVectorsReader) flatVectorsReader).getQuantizedVectorValues(field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have that method on the flat vectors reader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump: this sounds like a good addition to the flat vectors reader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I honestly don't know. It is sort of like having getHnswGraph part of the KnnVectorsReader. It assumes that this is a thing that all readers will need. The only thing that needs this getQuantizedVectorValues is something that is merging int8 quantized values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough.

@benwtrent
Copy link
Member Author

@jpountz the goal of this change is not just making code reusable. But:

  • Allowing folks who don't want HNSW to take advantage of the per-segment quantization and logic. Paging in float32 values, even when not using HNSW, can take time and memory. Reducing the size and just using byte values is really attractive.
  • being able to plugin a different flat storage no matter the vector indexer is useful as I would think we will want to add other quantizations in the future (e.g. bfloat16).

I suppose your suggestion around separate HNSW formats with flat storage hardcoded can still satisfy my second point, as long as the classes are usable.

But, I would really like to not lose the first goal, making quantization available to users who want flat vector storage.

@benwtrent
Copy link
Member Author

OK, @jpountz thinking about it more. To do what you are suggesting, I think the following would work:

  • Force Lucene99HnswVectorsReader & Lucene99HnswVectorsWriter to take a FlatVectors[Read|Writer]. This way the hard coded *Format files can reuse all the common code.
  • Remove the SPI the formats are now required to indicate how the vectors are stored and read
  • Remove the inheritance tree for the FlatVector* things so that they are not knn things.

In a separate PR I guess we can talk about a "flat" vector codec and if that is desired or not.

The "*Flat" things now get users very far, they just need to have a custom codec that provides a couple lines of code to forward the searching, etc.

@jpountz
Copy link
Contributor

jpountz commented Oct 30, 2023

Thanks, splitting the way you describe would make me happy.

I had not understood that the flat codec was a goal. Now that I think more about it, I wonder if we should better separate storage and indexing of vectors in FieldType/FieldInfo and file formats. E.g. it's a bit awkward that a KnnVectorQuery would try to reason about whether it should use the index or flat storage when in fact it only has flat storage. Maybe it could look more like keywords, where you decide separately whether to index them (index options) and/or store them (doc values).

@benwtrent
Copy link
Member Author

@jpountz updated. Flat is no longer pluggable, two HNSW formats are exposed.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a big change so it's hard to review in detail, but the approach looks good to me.

return null;
public QuantizedByteVectorValues getQuantizedVectorValues(String field) throws IOException {
if (flatVectorsReader instanceof QuantizedVectorsReader) {
return ((QuantizedVectorsReader) flatVectorsReader).getQuantizedVectorValues(field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump: this sounds like a good addition to the flat vectors reader?

@jimczi
Copy link
Contributor

jimczi commented Nov 3, 2023

I agree with Adrien that hardcoded formats with a clear strategy are better.
We want to avoid exposing a knn format that takes another abstract format. That would be cryptic and difficult to navigate for users.
However I think we can reuse existing format as an implementation detail. The Lucene99HnswScalarQuantizedVectorsFormat could be hardcoded to use the Lucene99ScalarQuantizedVectorsFormat internally. The fact that we register the format does not mean that we expose the composability to users. In its current form the FlatVector... lineage is redundant if it's only to avoid to register the format explicitly?

@benwtrent
Copy link
Member Author

@jimczi what do you mean "existing format as implementation detail"?

The flat format is an implementation detail. Folks using the quantized hnsw do not have to supply a flat format.

And by "register format" do you mean we should register the flat formats? Right now they are just classes that encapsulate vector storage and merging.

@jimczi
Copy link
Contributor

jimczi commented Nov 3, 2023

The flat format is an implementation detail. Folks using the quantized hnsw do not have to supply a flat format.

We can register the flat format for direct usage (outside of HNSW) and use them in the HNSW formats as an implementation detail? My point is that we don't need the FlatVector... classes. We can expose the quantized hnsw format with the quantiles option and use the scalar flat format internally. Users won't have to supply the flat format explicitly but it will be used internally. For instance if we introduce the half-float format at a later stage as Float16FlatVectorsFormat, we would need to expose a HNSWFloat16VectorsFormat to make it available with HNSW. In other words, we don't allow composable formats explicitly but each high level format can reuse other format internally (as an implementation detail).

@benwtrent
Copy link
Member Author

@jimczi the HNSWWriter and Readers need the passed flat vector readers and writers to provide specific functions. Like the mergeOneField that returns closeable scorers.

I am not sure all knnVectorReaders/Writers should have that method.

I think we should expose the flat formats in the codec. But the required new functions for indexing the vectors seem to justify a new abstraction.

Or do you think they are just KnnReader/Writers and we update that base abstraction?

@jimczi
Copy link
Contributor

jimczi commented Nov 3, 2023

I think we should expose the flat formats in the codec. But the required new functions for indexing the vectors seem to justify a new abstraction.

Can we add the abstraction as an extension of the existing knnVectorFormat/Readers/Writers? So instead of adding the functions to the base class, we introduce the flat format as an abstract class that extends the base one.
That would allow to reference it on the HNSW format directly and all flat formats would inherit from it directly.

@benwtrent
Copy link
Member Author

@jimczi thinking about it more, it seems to be a Flat* index format for vector search will require a different API. Right now, kNN search assumes the user provides pre-filters. However, when doing a flat vector search, there is no such thing as a "pre-filter".

In fact, a flat vector search should behave like a typical scorer, iterating over matching documents and scoring them.

The question then becomes, could we just use the LeafReader#getFloatVectorValues?

We might be able to, but this could be a bigger change and might not buy a big improvement. It seems counter intuitive to de-quantize every vector back into a float[] instead of just quantizing the query vector into byte[].

It almost seems like these flat formats should return a RandomVectorScorer instead of satisfying the typical searchNearestVectors interface.

Since this is a larger API discussion, do we think we can move forward with the way it is now (quantization for HNSW and other vector indices) and iterate on exposing flat vectors in a separate line of work?

@jimczi @jpountz if y'all have ideas around how the API could look, please let me know!

@jimczi
Copy link
Contributor

jimczi commented Nov 9, 2023

Sorry for the late reply.

Since this is a larger API discussion, do we think we can move forward with the way it is now (quantization for HNSW and other vector indices) and iterate on exposing flat vectors in a separate line of work?

Sure that's fine with me, can you update the title of this PR then to reflect what it's really doing? If we're not adding the flat format in the PR, it's like a cleanup to make code more reusable that don't affect the format, right?
Regarding the API discussion, I think that exposing the RandomValueVectorScorer should be enough for what we need but I am surely missing something.

@benwtrent benwtrent changed the title Adding new flat vector format and refactoring HNSW Refactoring HNSW to use a new internal FlatVectorFormat Nov 9, 2023
@benwtrent
Copy link
Member Author

@jimczi updated the title.

@benwtrent benwtrent merged commit a47ba33 into apache:main Nov 10, 2023
4 checks passed
@benwtrent benwtrent deleted the feature/refactor-flat-codecs branch November 10, 2023 19:05
benwtrent added a commit that referenced this pull request Nov 10, 2023
Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.

This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.

Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`.

Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`
javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 13, 2023
A change in Lucene99HnswVectorsFormat requires that we adapt our code, see
apache/lucene#12729
javanna added a commit to elastic/elasticsearch that referenced this pull request Nov 13, 2023
* Resolve compile error in DenseVectorFieldMapper

A change in Lucene99HnswVectorsFormat requires that we adapt our code, see
apache/lucene#12729

* Add new Lucene file extensions

* Fixing format name check

---------

Co-authored-by: Benjamin Trent <4357155+benwtrent@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants