Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872

msokolov · 2024-10-07T21:28:47Z

addresses #13831

The basic idea is to move the scratch arrays and cloned IndexInputs (generally, any stateful data) into things returned by KnnVectorValues, so that class itself no longer needs to be cloned in order to get unique sources of vectors (or scorers). ByteVectorValues and FloatVectorValues got a new vectors() method (this returns the "dictionary") that supports the random access. Also, RandomVectorScorer gets cloned inputs and scratch data when created rather than relying on getting these from its enclosing values instance.

Naming notes:

The issue calls for a "dictionary" interface, but I found the name a bit confusing, so I undertook the following renaming: The "dictionary" interface is represented by FloatVectorValues.Floats and ByteVectorValues.Bytes (hearkening back to the RandomAccessVectorValues classes) and these new objects are returned by the new *VectorValues.vectors() methods. Where these methods are called, I've changed the names of the variables storing these things to vectors. Instance of KnnVectorValues are now mostly stored in variables called vectorValues; these were called various things before including vectors, values, or vectorValues. I left some called values since I didn't want to touch any more more files.

I've renamed the method vectorValue(int ord) to get(int ord) since there were entirely too many vectors, values, and vectorValues running around.

I also ensured that KnnVectorValues.iterator() always returns a unique instance. Previously we had been caching in a few places and returning a shared instance, which seems like a bug, although I don't think it caused any problems given our usage.

All in all it's a lot of fussy non-functional changes but I do think the clarity makes it worth doing now after ~5 years of evolution of these APIs

…of 0)

msokolov · 2024-10-08T14:25:02Z

I'll merge to main soon and let tests noodle on this for a few days before backporting to 11.x. It seems benign, but it's easy to make an accidental slip in the code hurricane

benwtrent

The API does look cleaner, but I am concerned about heap and performance during graph building.

addAndEnsureDiversity will create many copies. I would expect this PR to create many more float[dim] arrays than we would before.

Have you done any benchmarking or profiling on this?

benwtrent · 2024-10-08T17:46:06Z

...ward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/OffHeapByteVectorValues.java

+    return new Bytes() {
+      IndexInput input = slice.clone();
+      ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize);
+      ;


Suggested change

;

benwtrent · 2024-10-08T18:07:46Z

...ne/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsReader.java

+      Floats rawVectors = rawVectorValues.vectors();
+      return new Floats() {
+        @Override
+        public float[] get(int ord) throws IOException {
+          return rawVectors.get(ord);
+        }
+      };


Suggested change

Floats rawVectors = rawVectorValues.vectors();

return new Floats() {

@Override

public float[] get(int ord) throws IOException {

return rawVectors.get(ord);

}

};

return rawVectorValues.vectors();

benwtrent · 2024-10-08T18:13:10Z

lucene/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedByteVectorValues.java

+      ByteBuffer byteBuffer = ByteBuffer.allocate(dimension);
+      byte[] binaryValue = byteBuffer.array();
+      IndexInput input = slice.clone();
+      float[] scoreCorrectionConstant = new float[1];


all these should be private & final. There are other instances where you do something similar, let's make things final that can be and private things can should be.

personally I don't care about making these final - the compiler already ensures that they are or it wouldn't let you use them in a closure like this. As for private, I don't think you can make local variables private, but maybe I am missing something.

lucene/core/src/java/org/apache/lucene/index/ExitableDirectoryReader.java

lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java

benwtrent · 2024-10-08T18:22:44Z

.../org/apache/lucene/internal/vectorization/Lucene99MemorySegmentByteVectorScorerSupplier.java

+      byte[] scratch1 = new byte[vectorByteSize];
+      byte[] scratch2 = new byte[vectorByteSize];


now we allocate scratch even if we don't need it, maybe this isn't that big of a deal?

Same goes for all the other memsegment scorers, we don't really need the scratch unless a memory segment isn't available.

Yeah, this just seemed cleaner than trying to make that conditional, and my assumption is these scorers are not created that often? Once per search? Although I guess when indexing that could be a lot (once per doc). The challenge here is that getSegment() is a member of the Supplier while the Scorers are the ones that should be supplying the scratch data, so we can't easily create scratch lazily. I guess we could create some new abstraction in here to handle that but it seems kind of messy.

Is there some way to know "up front" whether a memorysegment is going to be produced? If we knew that we could allocate scratch space or not based on that knowledge. I have to say I'm a little lost in this java21 MemorySegment code -- maybe @ChrisHegarty will weigh in and explain what the conditions are that lead to segmentSliceOrNull returning null?

We don't know during construction whether or not access to the vector data in backing segment will always be available. The main reason is that a vector may span across multiple memory segments. (one MSIndexInput can be made up of several memory segments)

This change is not right. The scratch buffers were created per supplier, since we know from the threading model that that is safe. Creating scratch buffers per scorer will be too expensive.

I have another idea. maybe we just delegate the null cases to the other on-heap scorer. That might be simpler. We do something similar in the native scorer we have in Elasticsearch. I can see how this looks in the branch, if u like?

I'm not sure I understand your idea, Chris, but if you want to have a go at it, by all means please do, and maybe I'll understand then :)

benwtrent · 2024-10-08T18:32:27Z

lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatVectorScorer.java

+    public RandomVectorScorer scorer(int ord) throws IOException {
+      ByteVectorValues.Bytes vectors1 = vectorValues.vectors();
+      ByteVectorValues.Bytes vectors2 = vectorValues.vectors();
+      return new RandomVectorScorer.AbstractRandomVectorScorer(vectorValues) {


I would expect this to create way more garbage during HNSW graph building. The RandomVectorScorerSupplier is passed around to the diverse checking, which will now, on each scorer that is created (which will likely be many of them for every node we add), we allocate new scratch space. Before, we had a single set of scratch space created just in the RandomVectorScorerSupplier.

I worry this will have a measurable performance impact and hurt heap usage.

yeah this seems like a bad consequence. Maybe we could switch from a supplier/scorer to a mutable scorer that can be "set" to a new vector as needed?

msokolov · 2024-10-08T21:00:46Z

Thanks for the insightful feedback - yeah I had been intending to do perf testing, and then got distracted by fascinating talks and kind of forgot about these concerns! Going through the code adding all these allocations I was kind of thinking most of them would be infrequent, but I agree if we are creating scorers per node that isn't going to be acceptable, so we need to find a way of sharing just enough but not too much. Anyway there's no rush to get this in, I'll take some time to dig in.

msokolov · 2024-10-09T03:43:17Z

hm there is some functional problem with the change that yields terrible recall for quantized vectors. I'll dig and fix and see if I can beef up the unit test coverage as well.

benwtrent · 2024-10-09T11:32:19Z

hm there is some functional problem with the change that yields terrible recall for quantized vectors. I'll dig and fix and see if I can beef up the unit test coverage as well.

This likely means somewhere the scratch space isn't being appropriately handled :/

…in Lucene90HnswVectorsReader

jpountz

I find it much cleaner this way too.

Unrelated, we should add support for absolute reads of float[] arrays and implement these dictionaries on top of a RandomAccessInput instead of an IndexInput. (for a follow-up PR)

jpountz · 2024-10-16T07:18:59Z

lucene/MIGRATE.md

@@ -892,3 +892,7 @@ segments are rewritten either via `IndexWriter.forceMerge` or
 ### Vector values APIs switched to primarily random-access

 `{Byte/Float}VectorValues` no longer inherit from `DocIdSetIterator`. Rather they extend a common class, `KnnVectorValues`, that provides a random access API (previously provided by `RandomAccessVectorValues`, now removed), and an `iterator()` method for retrieving `DocIndexIterator`: an iterator which is a DISI that also provides an `index()` method. Therefore, any iteration over vector values must now be performed using the values' `iterator()`. Random access works as before, but does not require casting to `RandomAccessVectorValues`.
+
+## Migration from Lucene 10.0 to Lucene 10.1


Should it be at the top? This file has most recent versions at the top.

msokolov · 2024-10-22T19:07:57Z

With the most recent commit I saw these luceneutil/knnPerfTest.py results:

1. baseline

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.816         0.294  1500000    10       6       32         50         no   341.37         110.92             1          1534.03
 0.811         0.308  1500000    10       6       32         50     7 bits   346.68          93.22             1          1906.16
 0.786         0.288  1500000    10       6       32         50     4 bits   346.28          89.15             1          1906.10

this change with defaults (no command line flags)

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.817         0.304  1500000    10       6       32         50         no     344.11      111.70             1          1533.94
 0.812         0.231  1500000    10       6       32         50     7 bits     354.29       89.76             1          1906.16
 0.785         0.239  1500000    10       6       32         50     4 bits     352.37        89.01             1          1906.12

This change with vector api enabled:

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.817         0.247  1500000    10       6       32         50         no     0.00           0.17             1          1533.94
 0.812         0.282  1500000    10       6       32         50     7 bits     0.00           0.17             1          1906.16
 0.785         0.207  1500000    10       6       32         50     4 bits     0.00           0.17             1          1906.12

This change with vector api and enable-native-access

recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.817         0.246  1500000    10       6       32         50         no     0.00           0.17             1          1533.94
 0.812         0.290  1500000    10       6       32         50     7 bits     0.00           0.17             1          1906.16
 0.785         0.206  1500000    10       6       32         50     4 bits     0.00           0.18             1          1906.12

So I think there is some slowdown in the quantized indexing. I think we need to find a solution for the over-allocations due to having moved this logic from ScorerSupplier to Scorer. The best idea I have is to make Scorers mutable and supply them with new target vectors as needed. WDYT?

jpountz · 2024-10-25T13:10:19Z

Can you clarify which allocation is the problematic one, and where it's done on the indexing path?

msokolov · 2024-10-25T13:15:48Z

Can you clarify which allocation is the problematic one, and where it's done on the indexing path?

See Ben's comments from ~2 weeks ago where he calls out the problem of overallocation. During indexing we call HnswGraphBuilder.diversityCheck() multiple times for each document (graph node) we insert, and in each of those calls we create scorers multiple times -- this is an n^2 algorithm (with n ~ number of neighbors). I'm proposing that instead of calling scorer() and creating a new scorer each time (which may in turn create a MemorySegment or a scratch array of some sort), that we instead have a mutable Scorer that can accept a new target vector.

ChrisHegarty · 2024-10-25T13:19:45Z

that we instead have a mutable Scorer that can accept a new target vector.

Yes, that is something that I've noodled on for a while now too - a scorer that accepts two ords, and returns the score. This will save gigabytes garbage, which can be seen in the blunder output of the nightly luceneutil runs. Tho, you do no have to do it all in this PR. E.g. https://blunders.io/jfr-demo/indexing-1kb-vectors-2024.10.24.18.04.28/top_allocators

benwtrent · 2024-10-25T13:43:32Z

I think a "merging scorer" would be good. The only place the "scorer supplier" is used is during graph building.

My initial concern with a "mutable scorer" is that it would also make the single scorer mutable, which seems weird to me. But I am happily to revisit this, especially since its blocking a nice refactor.

Given that all these random scorer stuff is internal APIs, we can do whatever is best with what we have.

msokolov · 2024-10-25T13:50:37Z

Yes, OK I now see quite a bit of this is a "preexisting condition" and maybe not exacerbated by this change. We are still creating more scratch arrays than we did before though, I think, because previously we would copy() the VectorValues in a caller, and allocate a new scratch array there, whereas now since we have pushed down the "create new scratch array" into the Scorer creation, and this happens many more times than we would previously have copied the VectorValues, we are creating and destroying many more of these scratch arrays. Maybe this is acceptable and we can iterate in a futher cleanup? Let me try a few more benchmarking runs and be a little clearer about the impact on query and indexing times. I'd like to also report allocations, but not sure how to do that w/luceneutil

msokolov · 2024-10-25T13:57:36Z

Maybe we could add a RandomVectorScorer.setTarget(int node) method that would only be implemented by the Scorers returned from ScorerSuppliers?

Michael Sokolov added 13 commits October 7, 2024 09:22

refactor float vector values random access

0b76ac3

refactor byte vector values random access

1c2977f

make sure KnnVectorValues.iterator() always returns a new value

debac32

fix cloning/sharing of vector scorer resources

273e8ed

renaming

ce70f4c

tidy

b4febca

more renaming

2fca27c

EMPTY

2e51380

CHANGES and MIGRATE entries

3b8d70f

a little more renaming

f5e0260

mopping up some more values->vectors

9c68a6e

fix javadoc

f035183

fix error introduced in refactoring (init lastSubIndex to -1 instead …

23c7497

…of 0)

benwtrent reviewed Oct 8, 2024

View reviewed changes

Michael Sokolov added 2 commits October 14, 2024 11:17

Add BaseKnnVectorsFormatTestCase.testRecall() and fix map ord to doc …

63a4d83

…in Lucene90HnswVectorsReader

Add BaseKnnVectorsFormatTestCase.testRecall() and fix map ord to doc …

2099589

…in Lucene90HnswVectorsReader

jpountz reviewed Oct 16, 2024

View reviewed changes

Michael Sokolov added 7 commits October 17, 2024 05:12

handle stray prints

6141900

test all similarities and more queries

61a0d79

fix Lucene90Hnsw that was aliasing vector values

5a6d709

remove stray print

bbe4d28

Merge remote-tracking branch 'origin/main' into knn-dictionary

1a2c3bb

fix initialization bug in SlowCompositeCodecReaderWrapper

568372f

simplifications from PR feedback

da06288

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872

Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872

msokolov commented Oct 7, 2024 •

edited

Loading

msokolov commented Oct 8, 2024

benwtrent left a comment

benwtrent Oct 8, 2024

benwtrent Oct 8, 2024

benwtrent Oct 8, 2024

msokolov Oct 22, 2024

benwtrent Oct 8, 2024

msokolov Oct 22, 2024

ChrisHegarty Oct 25, 2024 •

edited

Loading

ChrisHegarty Oct 25, 2024

msokolov Oct 25, 2024

benwtrent Oct 8, 2024

msokolov Oct 22, 2024

msokolov commented Oct 8, 2024

msokolov commented Oct 9, 2024

benwtrent commented Oct 9, 2024

jpountz left a comment

jpountz Oct 16, 2024

msokolov commented Oct 22, 2024

jpountz commented Oct 25, 2024

msokolov commented Oct 25, 2024

ChrisHegarty commented Oct 25, 2024 •

edited

Loading

benwtrent commented Oct 25, 2024

msokolov commented Oct 25, 2024

msokolov commented Oct 25, 2024

		byte[] scratch1 = new byte[vectorByteSize];
		byte[] scratch2 = new byte[vectorByteSize];

Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872

Are you sure you want to change the base?

Remove vector values copy() methods, moving IndexInput.clone() and temp storage into lower-level interfaces #13872

Conversation

msokolov commented Oct 7, 2024 • edited Loading

msokolov commented Oct 8, 2024

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisHegarty Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Oct 8, 2024

msokolov commented Oct 9, 2024

benwtrent commented Oct 9, 2024

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Oct 22, 2024

1. baseline

this change with defaults (no command line flags)

This change with vector api enabled:

This change with vector api and enable-native-access

jpountz commented Oct 25, 2024

msokolov commented Oct 25, 2024

ChrisHegarty commented Oct 25, 2024 • edited Loading

benwtrent commented Oct 25, 2024

msokolov commented Oct 25, 2024

msokolov commented Oct 25, 2024

msokolov commented Oct 7, 2024 •

edited

Loading

ChrisHegarty Oct 25, 2024 •

edited

Loading

ChrisHegarty commented Oct 25, 2024 •

edited

Loading