Refactoring HNSW to use a new internal FlatVectorFormat #12729

benwtrent · 2023-10-27T19:42:55Z

Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type.

This PR extracts out the vector storage into a new format Lucene99FlatVectorsFormat and adds new base class called FlatVectorsFormat. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats.

Additionally, this PR refactors the new Lucene99ScalarQuantizedVectorsFormat to be a FlatVectorsFormat.

Now, Lucene99HnswVectorsFormat is constructed with a Lucene99FlatVectorsFormat and a new Lucene99HnswScalarQuantizedVectorsFormat that uses Lucene99ScalarQuantizedVectorsFormat

jpountz

I like the idea of improving the composability, but I would prefer it to be more an implementation detail and not something that users get exposed to. In practice, this would mean making the following changes to your PR:

FlatVectorsFormat no longer extends KnnVectorsFormat (and removes some methods like getMaxDimension).
FlatVectorsFormat and its sub classes no longer get registered in SPI.
Whichever flat format is used in a non-flat format is something that's hardcoded in the KnnVectorsFormat as opposed to something that is configurable and serialized in the metadata so that the right flat format can be loaded at search time.
Lucene99HnswVectorsFormat gets split into Lucene99HnswVectorsFormat and Lucene99HnswScalarQuantizedVectorsFormat. Users now choose whether to use quantization in PerFieldKnnVectorsFormat rather than within Lucene99HnswVectorsFormat itself.

I realize that there is a counter argument for allowing everyone to create their own knn vectors format by configuring independently the indexing mechanism (e.g. HNSW vs. IVF), quantization (e.g. raw vs. scalar quantized) and maybe a couple other things, similarly to the DFR similarity. But then maybe we should consider diverging the APIs of the "flat" and indexing knn vectors formats further (thinking out loud).

jpountz · 2023-10-30T14:22:30Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java

-      return null;
+  public QuantizedByteVectorValues getQuantizedVectorValues(String field) throws IOException {
+    if (flatVectorsReader instanceof QuantizedVectorsReader) {
+      return ((QuantizedVectorsReader) flatVectorsReader).getQuantizedVectorValues(field);


Should we have that method on the flat vectors reader?

Bump: this sounds like a good addition to the flat vectors reader?

@jpountz I honestly don't know. It is sort of like having getHnswGraph part of the KnnVectorsReader. It assumes that this is a thing that all readers will need. The only thing that needs this getQuantizedVectorValues is something that is merging int8 quantized values.

OK, fair enough.

benwtrent · 2023-10-30T14:46:15Z

@jpountz the goal of this change is not just making code reusable. But:

Allowing folks who don't want HNSW to take advantage of the per-segment quantization and logic. Paging in float32 values, even when not using HNSW, can take time and memory. Reducing the size and just using byte values is really attractive.
being able to plugin a different flat storage no matter the vector indexer is useful as I would think we will want to add other quantizations in the future (e.g. bfloat16).

I suppose your suggestion around separate HNSW formats with flat storage hardcoded can still satisfy my second point, as long as the classes are usable.

But, I would really like to not lose the first goal, making quantization available to users who want flat vector storage.

benwtrent · 2023-10-30T16:16:12Z

OK, @jpountz thinking about it more. To do what you are suggesting, I think the following would work:

Force Lucene99HnswVectorsReader & Lucene99HnswVectorsWriter to take a FlatVectors[Read|Writer]. This way the hard coded *Format files can reuse all the common code.
Remove the SPI the formats are now required to indicate how the vectors are stored and read
Remove the inheritance tree for the FlatVector* things so that they are not knn things.

In a separate PR I guess we can talk about a "flat" vector codec and if that is desired or not.

The "*Flat" things now get users very far, they just need to have a custom codec that provides a couple lines of code to forward the searching, etc.

jpountz · 2023-10-30T17:25:52Z

Thanks, splitting the way you describe would make me happy.

I had not understood that the flat codec was a goal. Now that I think more about it, I wonder if we should better separate storage and indexing of vectors in FieldType/FieldInfo and file formats. E.g. it's a bit awkward that a KnnVectorQuery would try to reason about whether it should use the index or flat storage when in fact it only has flat storage. Maybe it could look more like keywords, where you decide separately whether to index them (index options) and/or store them (doc values).

benwtrent · 2023-10-30T21:16:40Z

@jpountz updated. Flat is no longer pluggable, two HNSW formats are exposed.

jpountz

Thanks, it's a big change so it's hard to review in detail, but the approach looks good to me.

jpountz · 2023-11-01T20:08:03Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java

-      return null;
+  public QuantizedByteVectorValues getQuantizedVectorValues(String field) throws IOException {
+    if (flatVectorsReader instanceof QuantizedVectorsReader) {
+      return ((QuantizedVectorsReader) flatVectorsReader).getQuantizedVectorValues(field);


Bump: this sounds like a good addition to the flat vectors reader?

jimczi · 2023-11-03T12:54:25Z

I agree with Adrien that hardcoded formats with a clear strategy are better.
We want to avoid exposing a knn format that takes another abstract format. That would be cryptic and difficult to navigate for users.
However I think we can reuse existing format as an implementation detail. The Lucene99HnswScalarQuantizedVectorsFormat could be hardcoded to use the Lucene99ScalarQuantizedVectorsFormat internally. The fact that we register the format does not mean that we expose the composability to users. In its current form the FlatVector... lineage is redundant if it's only to avoid to register the format explicitly?

benwtrent · 2023-11-03T14:34:35Z

@jimczi what do you mean "existing format as implementation detail"?

The flat format is an implementation detail. Folks using the quantized hnsw do not have to supply a flat format.

And by "register format" do you mean we should register the flat formats? Right now they are just classes that encapsulate vector storage and merging.

jimczi · 2023-11-03T14:54:48Z

The flat format is an implementation detail. Folks using the quantized hnsw do not have to supply a flat format.

We can register the flat format for direct usage (outside of HNSW) and use them in the HNSW formats as an implementation detail? My point is that we don't need the FlatVector... classes. We can expose the quantized hnsw format with the quantiles option and use the scalar flat format internally. Users won't have to supply the flat format explicitly but it will be used internally. For instance if we introduce the half-float format at a later stage as Float16FlatVectorsFormat, we would need to expose a HNSWFloat16VectorsFormat to make it available with HNSW. In other words, we don't allow composable formats explicitly but each high level format can reuse other format internally (as an implementation detail).

benwtrent · 2023-11-03T15:50:00Z

@jimczi the HNSWWriter and Readers need the passed flat vector readers and writers to provide specific functions. Like the mergeOneField that returns closeable scorers.

I am not sure all knnVectorReaders/Writers should have that method.

I think we should expose the flat formats in the codec. But the required new functions for indexing the vectors seem to justify a new abstraction.

Or do you think they are just KnnReader/Writers and we update that base abstraction?

jimczi · 2023-11-03T17:29:00Z

I think we should expose the flat formats in the codec. But the required new functions for indexing the vectors seem to justify a new abstraction.

Can we add the abstraction as an extension of the existing knnVectorFormat/Readers/Writers? So instead of adding the functions to the base class, we introduce the flat format as an abstract class that extends the base one.
That would allow to reference it on the HNSW format directly and all flat formats would inherit from it directly.

benwtrent · 2023-11-06T19:05:54Z

@jimczi thinking about it more, it seems to be a Flat* index format for vector search will require a different API. Right now, kNN search assumes the user provides pre-filters. However, when doing a flat vector search, there is no such thing as a "pre-filter".

In fact, a flat vector search should behave like a typical scorer, iterating over matching documents and scoring them.

The question then becomes, could we just use the LeafReader#getFloatVectorValues?

We might be able to, but this could be a bigger change and might not buy a big improvement. It seems counter intuitive to de-quantize every vector back into a float[] instead of just quantizing the query vector into byte[].

It almost seems like these flat formats should return a RandomVectorScorer instead of satisfying the typical searchNearestVectors interface.

Since this is a larger API discussion, do we think we can move forward with the way it is now (quantization for HNSW and other vector indices) and iterate on exposing flat vectors in a separate line of work?

@jimczi @jpountz if y'all have ideas around how the API could look, please let me know!

…at-codecs

jimczi · 2023-11-09T17:19:20Z

Sorry for the late reply.

Since this is a larger API discussion, do we think we can move forward with the way it is now (quantization for HNSW and other vector indices) and iterate on exposing flat vectors in a separate line of work?

Sure that's fine with me, can you update the title of this PR then to reflect what it's really doing? If we're not adding the flat format in the PR, it's like a cleanup to make code more reusable that don't affect the format, right?
Regarding the API discussion, I think that exposing the RandomValueVectorScorer should be enough for what we need but I am surely missing something.

benwtrent · 2023-11-09T18:35:30Z

@jimczi updated the title.

…at-codecs

Currently the HNSW codec does too many things, it not only indexes vectors, but stores them and determines how to store them given the vector type. This PR extracts out the vector storage into a new format `Lucene99FlatVectorsFormat` and adds new base class called `FlatVectorsFormat`. This allows for some additional helper functions that allow an indexing codec (like HNSW) take advantage of the flat formats. Additionally, this PR refactors the new `Lucene99ScalarQuantizedVectorsFormat` to be a `FlatVectorsFormat`. Now, `Lucene99HnswVectorsFormat` is constructed with a `Lucene99FlatVectorsFormat` and a new `Lucene99HnswScalarQuantizedVectorsFormat` that uses `Lucene99ScalarQuantizedVectorsFormat`

A change in Lucene99HnswVectorsFormat requires that we adapt our code, see apache/lucene#12729

* Resolve compile error in DenseVectorFieldMapper A change in Lucene99HnswVectorsFormat requires that we adapt our code, see apache/lucene#12729 * Add new Lucene file extensions * Fixing format name check --------- Co-authored-by: Benjamin Trent <4357155+benwtrent@users.noreply.github.com>

Refactoring, adding new flat vector format files

bd58ea8

benwtrent added the vector-based-search label Oct 27, 2023

benwtrent and others added 2 commits October 27, 2023 15:49

adding CHANGES

9b5b853

Merge branch 'main' into feature/refactor-flat-codecs

80054d0

benwtrent requested review from jpountz, jimczi and javanna October 30, 2023 12:47

jpountz reviewed Oct 30, 2023

View reviewed changes

benwtrent added 2 commits October 30, 2023 15:05

Merge branch 'main' into feature/refactor-flat-codecs

005dc0d

refactoring

0783320

benwtrent added 2 commits November 1, 2023 09:24

Merge branch 'main' into feature/refactor-flat-codecs

0bc90d2

fixing file closing on exception

7d91ae6

benwtrent requested a review from jpountz November 1, 2023 17:01

jpountz reviewed Nov 1, 2023

View reviewed changes

jpountz approved these changes Nov 2, 2023

View reviewed changes

Merge branch 'main' into feature/refactor-flat-codecs

939d201

benwtrent mentioned this pull request Nov 4, 2023

Add Scalar Quantization codec for Vectors #12497

Closed

Merge remote-tracking branch 'upstream/main' into feature/refactor-fl…

3c9015f

…at-codecs

benwtrent changed the title ~~Adding new flat vector format and refactoring HNSW~~ Refactoring HNSW to use a new internal FlatVectorFormat Nov 9, 2023

benwtrent added 5 commits November 9, 2023 14:33

Merge remote-tracking branch 'upstream/main' into feature/refactor-fl…

8aac7d3

…at-codecs

adjusting due to bug fix

2b7193c

Fix vector quantization on segment merge

86c8dfc

Fix vector quantization on segment merge

3e2b235

Merge remote-tracking branch 'upstream/main' into feature/refactor-fl…

44ba3dd

…at-codecs

benwtrent merged commit a47ba33 into apache:main Nov 10, 2023
4 checks passed

benwtrent deleted the feature/refactor-flat-codecs branch November 10, 2023 19:05

javanna added a commit to javanna/elasticsearch that referenced this pull request Nov 13, 2023

Resolve compile error in DenseVectorFieldMapper

cab7018

A change in Lucene99HnswVectorsFormat requires that we adapt our code, see apache/lucene#12729

javanna mentioned this pull request Nov 13, 2023

Resolve compile error in DenseVectorFieldMapper elastic/elasticsearch#102066

Merged

msokolov mentioned this pull request Dec 14, 2023

Test failure in TestKnnGraph.testMultiThreadedSearch #12940

Closed

sarthakn7 mentioned this pull request Mar 4, 2024

Lucene 9.9 upgrade Yelp/nrtsearch#625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring HNSW to use a new internal FlatVectorFormat #12729

Refactoring HNSW to use a new internal FlatVectorFormat #12729

benwtrent commented Oct 27, 2023 •

edited

Loading

jpountz left a comment

jpountz Oct 30, 2023

jpountz Nov 1, 2023

benwtrent Nov 2, 2023

jpountz Nov 2, 2023

benwtrent commented Oct 30, 2023

benwtrent commented Oct 30, 2023

jpountz commented Oct 30, 2023

benwtrent commented Oct 30, 2023

jpountz left a comment

jpountz Nov 1, 2023

jimczi commented Nov 3, 2023

benwtrent commented Nov 3, 2023

jimczi commented Nov 3, 2023

benwtrent commented Nov 3, 2023

jimczi commented Nov 3, 2023

benwtrent commented Nov 6, 2023

jimczi commented Nov 9, 2023

benwtrent commented Nov 9, 2023

Refactoring HNSW to use a new internal FlatVectorFormat #12729

Refactoring HNSW to use a new internal FlatVectorFormat #12729

Conversation

benwtrent commented Oct 27, 2023 • edited Loading

jpountz left a comment

Choose a reason for hiding this comment

jpountz Oct 30, 2023

Choose a reason for hiding this comment

jpountz Nov 1, 2023

Choose a reason for hiding this comment

benwtrent Nov 2, 2023

Choose a reason for hiding this comment

jpountz Nov 2, 2023

Choose a reason for hiding this comment

benwtrent commented Oct 30, 2023

benwtrent commented Oct 30, 2023

jpountz commented Oct 30, 2023

benwtrent commented Oct 30, 2023

jpountz left a comment

Choose a reason for hiding this comment

jpountz Nov 1, 2023

Choose a reason for hiding this comment

jimczi commented Nov 3, 2023

benwtrent commented Nov 3, 2023

jimczi commented Nov 3, 2023

benwtrent commented Nov 3, 2023

jimczi commented Nov 3, 2023

benwtrent commented Nov 6, 2023

jimczi commented Nov 9, 2023

benwtrent commented Nov 9, 2023

benwtrent commented Oct 27, 2023 •

edited

Loading