Introduce derived vector source via stored fields #2449

jmazanec15 · 2025-01-27T19:08:45Z

Description

This PR introduces derived source for flat vector field mapper via the approach outlined in #2377.

First, for some quick background for reviewers: In OpenSearch, the source refers to a per-document StoredField (key = "_source") that stores the json representation of the document. This field is not searchable - instead, it used during the fetch phase in order to return fields of documents that matched the search request to end users. For instance, a user may search an index using a knn search over a field called passage_embedding, but they really only want to get back the field passage_text. During fetch, the source field of the k-Nearest Neighbors would be fetched and the passage_text field parsed out and sent back. In addition to this, _source may used to reindex documents and or update and delete by queries.

The goal of this PR/feature is to remove the vectors from this _source field on disk, but inject them back from other data formats into the _source when needed.

To implement this, from a high level, we introduce a custom StoredFieldsFormat. On write, this StoredFieldsFormat removes the vector fields from source. On read, it puts them back if necessary.

In its current state, this PR supports:

Flat (non object) vector fields
Single level nested vector fields (still cleaning up a bit but it passed tests in DerivedSourceIT
Arbitrarily nested object fields

Here are the following todo items:

Nested support. I initially had some support for nesting but this needs to be rethought
Integrate optimization for skipping source reconstruction (see Propagate includes and excludes from fetchSourceContext to FieldsVisitor OpenSearch#17080)
Introduce feature flag settings
Generic object field types
Confirm flat mappings work as expected.

Some leftovers after this PR. Leaving out now because feature is experimental for 2.19 and properly isolated behind index setting

Increase test coverage - need to add uTs
Performance testing - need to show performance of reindexing and updates.
Look into optimization

Related Issues

#2377

Check List

New functionality includes testing.
New functionality has been documented.
Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Work is still WIP but rootobject is working okay. Signed-off-by: John Mazanec <jmazane@amazon.com>

Signed-off-by: John Mazanec <jmazane@amazon.com>

Vikasht34

Took the first Pass , A really good shoutout to

Clean Interfaces .
Good use of abstraction on Visitor and clean injection.
3.Code has good modularity .
Function interface and supplier.

Vikasht34 · 2025-01-28T23:50:49Z

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java

+    @Override
+    public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo segmentInfo, FieldInfos fieldInfos, IOContext ioContext)
+        throws IOException {
+        List<FieldInfo> derivedVectorFields = new ArrayList<>();


Can we short circuit the code to return early in case of setting is disabled?

In the reader format, due to SPI, we do not have access to KNNSettings. So we cant check if the setting is set or not. I add a shortcircuit if no fields have the attribute: https://github.com/opensearch-project/k-NN/pull/2449/files#diff-f8a9ebad33a21a479b30eb0dfa0bcc6aa7ddfcb6c464eca0371b60d3c3a38e77R49

Vikasht34 · 2025-01-28T23:51:50Z

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java

+    @Override
+    public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo segmentInfo, FieldInfos fieldInfos, IOContext ioContext)
+        throws IOException {
+        List<FieldInfo> derivedVectorFields = new ArrayList<>();


Should we lazy create derivedVectorFields else we will have empty list of array even though setting is disabled?

sure I can add that in the below for loop.

Vikasht34 · 2025-01-28T23:56:14Z

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java

+    private boolean shouldInject = true;
+
+    @Override
+    public void document(int docId, StoredFieldVisitor storedFieldVisitor) throws IOException {


The document method directly couples vector injection with DerivedSourceVectorInjector. This makes it harder to extend or modify the injection logic. Could we abstracts or implement some loose coupling here?

Not sure I understand completely - we do need to create a custom stored fields visitor in order to get access to the source, so I needed to pass it in there. I could add the logics around the fieldsvisitor casting into the DerivedSourceStoredFieldVisitor.

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java

shatejas

Reviewed the reader

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java

src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceVectorInjector.java

shatejas · 2025-01-29T00:15:44Z

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java

+
+    @Override
+    public StoredFieldsReader clone() {
+        return new DerivedSourceStoredFieldsReader(delegate.clone(), derivedSourceVectorInjector);


Will clone affect the refcounts for delegate in any way? Are we sure it will be closed when its supposed to

let me double check this.

This class is supposed to heavily delegate. So, when a certain method is called, we want to call the delegate's method. Hence, I believe this is correct

similar example: https://github.com/apache/lucene/blob/branch_9_12/lucene/core/src/java/org/apache/lucene/index/SortingCodecReader.java#L442-L464

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java

src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceVectorInjector.java

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java

shatejas · 2025-01-29T01:42:35Z

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/KNN9120Codec.java

+    @Override
+    public StoredFieldsFormat storedFieldsFormat() {
+        DerivedSourceReadersSupplier derivedSourceReadersSupplier = new DerivedSourceReadersSupplier(
+            (segmentReadState) -> knnVectorsFormat().fieldsReader(segmentReadState),


nit: If this is for Lucene engine, you will also open index inputs for graph files, might be a bit unnecessary. you might be able to do

() -> new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()).fieldsReader

The only caveat is that will changing lucene versions it might be hard to keep a trace

Will there be any penalty incurred for opening indexInput but not reading from it? My hesitancy to use Lucene99FlatVectorsFormat is that I worry we dont explicitly mention thats how we are storing vectors. So Im leaning towards leaving how it is.

I tried this change and ended up getting an error - so might hold off:

[2025-01-28T20:09:15,944][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-0] [original-enable-testnestedmultidocbasiczcfrqz][0] marking and sending shard failed due to [shard failure, reason [refresh failed source[schedule]]] org.apache.lucene.index.CorruptIndexException: Problem reading index from store(ByteSizeCachingDirectory(HybridDirectory@/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/GfG-oBsOSoy75aVhmKoErg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@47495223)) (resource=store(ByteSizeCachingDirectory(HybridDirectory@/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/GfG-oBsOSoy75aVhmKoErg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@47495223))) at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:165) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:545) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:607) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1865) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1844) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4705) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1300) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1444) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955) [opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.base/java.lang.Thread.run(Thread.java:833) [?:?] Caused by: java.io.FileNotFoundException: No sub-file with id .vemf found in compound file "_0.cfs" (fileName=_0.vemf files: [_Lucene912_0.tmd, _Lucene912_0.psm, .fnm, _Lucene90_0.dvd, .kdd, _Lucene912_0.tip, .kdm, _Lucene90_0.dvm, _Lucene912_0.tim, _Lucene912_0.doc, .kdi, _NativeEngines990KnnVectorsFormat_0.vec, .fdm, .fdx, _NativeEngines990KnnVectorsFormat_0.vemf, .fdt]) at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.openInput(Lucene90CompoundReader.java:170) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.opensearch.knn.index.codec.KNN80Codec.KNN80CompoundDirectory.openInput(KNN80CompoundDirectory.java:50) ~[?:?] at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.readMetadata(Lucene99FlatVectorsReader.java:90) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.<init>(Lucene99FlatVectorsReader.java:65) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat.fieldsReader(Lucene99FlatVectorsFormat.java:95) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48] at org.opensearch.knn.index.codec.KNN9120Codec.KNN9120Codec.lambda$storedFieldsFormat$0(KNN9120Codec.java:74) ~[?:?] at org.opensearch.knn.index.codec.derivedsource.DerivedSourceReadersSupplier.getReaders(DerivedSourceReadersSupplier.java:36) ~[?:?] at org.opensearch.knn.index.codec.derivedsource.DerivedSourceVectorInjector.<init>(DerivedSourceVectorInjector.java:54) ~[?:?] at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.createDerivedSourceVectorInjector(DerivedSourceStoredFieldsReader.java:63) ~[?:?] at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.<init>(DerivedSourceStoredFieldsReader.java:59) ~[?:?] at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.<init>(DerivedSourceStoredFieldsReader.java:44) ~[?:?] at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsFormat.fieldsReader(DerivedSourceStoredFieldsFormat.java:57) ~[?:?]

Makes sense, Thanks for trying. Maybe added a TODO for now

Will there be any penalty incurred for opening indexInput but not reading from it?

If its already open, then no since its mmapped. But otherwise it will. Was just trying to avoid it

Signed-off-by: John Mazanec <jmazane@amazon.com>

shatejas · 2025-01-29T17:22:07Z

...ain/java/org/opensearch/knn/index/codec/derivedsource/RootPerFieldDerivedVectorInjector.java

+    public void inject(int docId, Map<String, Object> sourceAsMap) throws IOException {
+        KNNVectorValues<?> vectorValues = vectorValuesSupplier.get();
+        if (vectorValues.docId() == docId || vectorValues.advance(docId) == docId) {
+            sourceAsMap.put(fieldInfo.name, vectorValues.getVector());


Do we need conditional clone vector here?

I think in general, I shouldnt need to clone the vector because it get serialized (and copied) later. Im going to remove it from above.

@navneet1v can you double check my reasoning here?

@shatejas and @jmazanec15 yes we will need a conditional clone here for the vector because vector values will give vector with same reference always. and For a Map<String, Object> sourceAsMap the map will just store the reference of the vector and won't do any serialization during map insertions

Discussed with @navneet1v - need to clone

Signed-off-by: John Mazanec <jmazane@amazon.com>

src/main/java/org/opensearch/knn/index/KNNSettings.java

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java

navneet1v · 2025-01-29T17:49:09Z

src/main/java/org/opensearch/knn/index/vectorvalues/KNNVectorValuesFactory.java

@@ -72,6 +74,37 @@ public static <T> KNNVectorValues<T> getVectorValues(final FieldInfo fieldInfo,
        return getVectorValues(FieldInfoExtractor.extractVectorDataType(fieldInfo), vectorValuesIterator);
    }

+    /**
+     * Returns a {@link KNNVectorValues} for the given {@link FieldInfo} and {@link LeafReader}


[nit-pick] the java doc is not correct.

navneet1v · 2025-01-29T17:49:53Z

src/main/java/org/opensearch/knn/index/vectorvalues/KNNVectorValuesFactory.java

+     */
+    public static <T> KNNVectorValues<T> getVectorValues(
+        final FieldInfo fieldInfo,
+        final DocValuesProducer docValuesProducer,


do we need docValuesProducer here? Since this feature will be applicable on newer indexes right?

Did we completely get rid of ability to use docValuesProducer for vectors? I thought it could still be possible.

since the DV is used for older indices and this setting is never going to be enabled for new indices so we don't need this.

navneet1v · 2025-01-29T17:51:55Z

src/main/java/org/opensearch/knn/index/mapper/LuceneFieldMapper.java

@@ -117,6 +128,12 @@ private LuceneFieldMapper(
            this.vectorFieldType = null;
        }

+        if (isDerivedSourceEnabled) {
+            this.fieldType = new FieldType(this.fieldType);


why we need a copying of fieldType for LuceneFieldMapper?

It gets frozen in Lucene -

k-NN/src/main/java/org/opensearch/knn/index/VectorDataType.java

Line 45 in 135f7eb

return KnnByteVectorField.createFieldType(dimension / Byte.SIZE, VectorSimilarityFunction.EUCLIDEAN);

. So, we need to copy and add. Ill add a comment.

oh yeah my bad. For lucene we freeze the field early.

navneet1v · 2025-01-29T17:54:55Z

src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java

@@ -363,7 +370,8 @@ public Mapper.Builder<?> parse(String name, Map<String, Object> node, ParserCont
                modelDaoSupplier.get(),
                parserContext.indexVersionCreated(),
                null,
-                null
+                null,
+                KNNSettings.isKNNDerivedSourceEnabled(parserContext.getSettings())


[Question]
should enable this check for specific versions or for all the older versions this setting will be false?

I believe this will be false for all older versions. Will validate manually by de-registering the setting, creating an index, and then checking this value.

navneet1v · 2025-01-29T17:59:14Z

...ain/java/org/opensearch/knn/index/codec/derivedsource/RootPerFieldDerivedVectorInjector.java

+    public void inject(int docId, Map<String, Object> sourceAsMap) throws IOException {
+        KNNVectorValues<?> vectorValues = vectorValuesSupplier.get();
+        if (vectorValues.docId() == docId || vectorValues.advance(docId) == docId) {
+            sourceAsMap.put(fieldInfo.name, vectorValues.getVector());


@shatejas and @jmazanec15 yes we will need a conditional clone here for the vector because vector values will give vector with same reference always. and For a Map<String, Object> sourceAsMap the map will just store the reference of the vector and won't do any serialization during map insertions

navneet1v · 2025-01-29T18:37:51Z

.../java/org/opensearch/knn/index/codec/derivedsource/PerFieldDerivedVectorInjectorFactory.java

+/**
+ * Factory for creating {@link PerFieldDerivedVectorInjector} instances.
+ */
+public class PerFieldDerivedVectorInjectorFactory {


can we make this as package private?

will change

navneet1v · 2025-01-29T18:57:59Z

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java

+     */
+    @Override
+    public StoredFieldsReader getMergeInstance() {
+        try {


can we just give the this instance why we are creating a new instance?

The issue is that for merging, the reader is used. So, if we use this, it will add the vectors back into source

lets add this as a java doc here. This seems like a good case where someone needs to know why we need to create merge instance.

I think covered in abvove comment

src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsWriter.java

src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceVectorInjector.java

...n/java/org/opensearch/knn/index/codec/derivedsource/NestedPerFieldDerivedVectorInjector.java

navneet1v · 2025-01-29T19:31:13Z

...n/java/org/opensearch/knn/index/codec/derivedsource/NestedPerFieldDerivedVectorInjector.java

+        if (fieldInfo.hasNorms() && derivedSourceReaders.getNormsProducer() != null) { // the field indexes norms
+            iterator = derivedSourceReaders.getNormsProducer().getNorms(fieldInfo);
+        } else if (fieldInfo.getVectorDimension() != 0 && derivedSourceReaders.getKnnVectorsReader() != null) { // the field indexes vectors
+            switch (fieldInfo.getVectorEncoding()) {
+                case FLOAT32:
+                    iterator = derivedSourceReaders.getKnnVectorsReader().getFloatVectorValues(fieldInfo.name);
+                    break;
+                case BYTE:
+                    iterator = derivedSourceReaders.getKnnVectorsReader().getByteVectorValues(fieldInfo.name);
+                    break;
+            }
+        } else if (fieldInfo.getDocValuesType() != DocValuesType.NONE && derivedSourceReaders.getDocValuesProducer() != null) { // the field


why we would have 3 types of field indexes? when this is just vector field.

This is not necessarily the vector field. This is for a field in one child before adding back the vector. We are trying to figure out which docId that child maps to. To do it, I get the first field on/after the offset that contains that field.

So, this is basically field exists - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java

Signed-off-by: John Mazanec <jmazane@amazon.com>

build.gradle

src/test/java/org/opensearch/knn/integ/DerivedSourceIT.java

Signed-off-by: John Mazanec <jmazane@amazon.com>

navneet1v · 2025-01-29T22:23:05Z

src/main/java/org/opensearch/knn/common/KNNConstants.java

+
+    public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_KEY = "knn-derived-source-enabled";
+    public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_TRUE_VALUE = "true";
+    public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_FALSE_VALUE = "false";


[nit-pick] this constant is not used.

@VijayanB mentioned this. Will remove in another review

VijayanB · 2025-01-29T23:16:51Z

src/main/java/org/opensearch/knn/index/mapper/FlatVectorFieldMapper.java

        );
        // setting it explicitly false here to ensure that when flatmapper is used Lucene based Vector field is not created.
        this.useLuceneBasedVectorField = false;
        this.perDimensionValidator = selectPerDimensionValidator(vectorDataType);
        this.fieldType = new FieldType(KNNVectorFieldMapper.Defaults.FIELD_TYPE);
        this.fieldType.setDocValuesType(DocValuesType.BINARY);
+        if (isDerivedSourceEnabled) {


Are we adding only if it is true? is that intentional? if so why do we have DERIVED_VECTOR_FIELD_ATTRIBUTE_FALSE_VALUE ?

Good point. Im going to remove FALSE in another PR.

VijayanB · 2025-01-29T23:19:52Z

src/main/java/org/opensearch/knn/index/vectorvalues/KNNVectorValuesFactory.java

+        } else if (docValuesProducer != null) {
+            docIdSetIterator = docValuesProducer.getBinary(fieldInfo);
+        } else {
+            throw new IllegalArgumentException("Field does not have vector values and DocValues");


nit: Field should have either vector values or DocValues

Will update in another PR

shatejas · 2025-01-29T23:43:25Z

Looks good overall

shatejas · 2025-01-29T19:51:12Z

...n/java/org/opensearch/knn/index/codec/derivedsource/NestedPerFieldDerivedVectorInjector.java

+        // Setup the iterator. Return if no parent
+        String childFieldName = ParentChildHelper.getChildField(childFieldInfo.name);
+        String parentFieldName = ParentChildHelper.getParentField(childFieldInfo.name);
+        if (parentFieldName == null) {


nit: out of curiosity, is this just being defensive or are we handling a case here?

being defensive

opensearch-trigger-bot · 2025-01-29T23:51:30Z

The backport to main failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-main main
# Navigate to the new working tree
cd .worktrees/backport-main
# Create a new branch
git switch --create backport/backport-2449-to-main
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 59b8e6bb24124c4f624e4711cf11f89948ffc594
# Push it to GitHub
git push --set-upstream origin backport/backport-2449-to-main
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-main

Then, create a pull request where the base branch is main and the compare/head branch is backport/backport-2449-to-main.

Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Currently, feature is experimental and behind a feature flag via index setting. In the future, we need to iterate to improve performance and stability for nested/object portions. Signed-off-by: John Mazanec <jmazane@amazon.com> (cherry picked from commit 59b8e6b)

…#2449) Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Currently, feature is experimental and behind a feature flag via index setting. In the future, we need to iterate to improve performance and stability for nested/object portions. Signed-off-by: John Mazanec <jmazane@amazon.com>

Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Currently, feature is experimental and behind a feature flag via index setting. In the future, we need to iterate to improve performance and stability for nested/object portions. Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 requested review from heemin32, navneet1v, VijayanB, vamshin, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, luyuncheng, shatejas and 0ctopus13prime as code owners January 27, 2025 19:08

jmazanec15 added backport main Features Introduces a new unit of functionality that satisfies a requirement labels Jan 27, 2025

jmazanec15 added 2 commits January 28, 2025 07:42

Initial attempt at first level nesting

31de672

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 force-pushed the derived-source-vectors-2.x branch from c61bd14 to c7db83e Compare January 28, 2025 15:43

Add the include/exclude optimization

927b6de

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 force-pushed the derived-source-vectors-2.x branch from c7db83e to 927b6de Compare January 28, 2025 16:01

jmazanec15 added 6 commits January 28, 2025 13:31

Fix bugs and improve tests

58164bb

Signed-off-by: John Mazanec <jmazane@amazon.com>

Remove old test

9c8394e

Signed-off-by: John Mazanec <jmazane@amazon.com>

Cleanup

c8ff878

Signed-off-by: John Mazanec <jmazane@amazon.com>

Move derived to default as false

85a4c09

Signed-off-by: John Mazanec <jmazane@amazon.com>

Fix up feature flag

939cec1

Signed-off-by: John Mazanec <jmazane@amazon.com>

Fix issue with fieldtype for lucene

4bc2828

Signed-off-by: John Mazanec <jmazane@amazon.com>

Vikasht34 reviewed Jan 29, 2025

View reviewed changes

shatejas reviewed Jan 29, 2025

View reviewed changes

Address initial comments

74387f9

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 force-pushed the derived-source-vectors-2.x branch from a6fc86e to 74387f9 Compare January 29, 2025 04:22

Add partial support for object mappings

d449151

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 added 2 commits January 29, 2025 05:23

Ignore its for now

c3483df

Signed-off-by: John Mazanec <jmazane@amazon.com>

Fix issues with object type

399d281

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 force-pushed the derived-source-vectors-2.x branch from 6e104f2 to 399d281 Compare January 29, 2025 17:14

shatejas reviewed Jan 29, 2025

View reviewed changes

jmazanec15 added 2 commits January 29, 2025 09:31

Remove conditional clone vector

14d3ead

Signed-off-by: John Mazanec <jmazane@amazon.com>

Conditionally clone vectors

1fe7302

Signed-off-by: John Mazanec <jmazane@amazon.com>

navneet1v reviewed Jan 29, 2025

View reviewed changes

jmazanec15 added 2 commits January 29, 2025 11:48

Make setting completely unmodifiable

b3aba03

Signed-off-by: John Mazanec <jmazane@amazon.com>

Update based on feedback

fd32a12

Signed-off-by: John Mazanec <jmazane@amazon.com>

jmazanec15 commented Jan 29, 2025

View reviewed changes

build.gradle Show resolved Hide resolved

navneet1v reviewed Jan 29, 2025

View reviewed changes

src/test/java/org/opensearch/knn/integ/DerivedSourceIT.java Show resolved Hide resolved

Minor updates

6462c4b

Signed-off-by: John Mazanec <jmazane@amazon.com>

navneet1v reviewed Jan 29, 2025

View reviewed changes

VijayanB reviewed Jan 29, 2025

View reviewed changes

navneet1v approved these changes Jan 29, 2025

View reviewed changes

shatejas approved these changes Jan 29, 2025

View reviewed changes

jmazanec15 merged commit 59b8e6b into opensearch-project:2.x Jan 29, 2025
100 checks passed

jmazanec15 added the backport 2.19 label Jan 30, 2025

opensearch-trigger-bot bot mentioned this pull request Jan 30, 2025

[Backport 2.19] Introduce derived vector source via stored fields #2467

Merged

jmazanec15 mentioned this pull request Feb 6, 2025

Backport derived source feature to main #2500

Merged

5 tasks

Introduce derived vector source via stored fields #2449

Introduce derived vector source via stored fields #2449

Conversation

jmazanec15 commented Jan 27, 2025 • edited Loading

Description

Related Issues

Check List

Vikasht34 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatejas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shatejas commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Jan 29, 2025

jmazanec15 commented Jan 27, 2025 •

edited

Loading

jmazanec15 Jan 29, 2025 •

edited

Loading