Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce derived vector source via stored fields #2449

Merged

Conversation

jmazanec15
Copy link
Member

@jmazanec15 jmazanec15 commented Jan 27, 2025

Description

This PR introduces derived source for flat vector field mapper via the approach outlined in #2377.

First, for some quick background for reviewers: In OpenSearch, the source refers to a per-document StoredField (key = "_source") that stores the json representation of the document. This field is not searchable - instead, it used during the fetch phase in order to return fields of documents that matched the search request to end users. For instance, a user may search an index using a knn search over a field called passage_embedding, but they really only want to get back the field passage_text. During fetch, the source field of the k-Nearest Neighbors would be fetched and the passage_text field parsed out and sent back. In addition to this, _source may used to reindex documents and or update and delete by queries.

The goal of this PR/feature is to remove the vectors from this _source field on disk, but inject them back from other data formats into the _source when needed.

To implement this, from a high level, we introduce a custom StoredFieldsFormat. On write, this StoredFieldsFormat removes the vector fields from source. On read, it puts them back if necessary.

In its current state, this PR supports:

  1. Flat (non object) vector fields
  2. Single level nested vector fields (still cleaning up a bit but it passed tests in DerivedSourceIT
  3. Arbitrarily nested object fields

Here are the following todo items:

Some leftovers after this PR. Leaving out now because feature is experimental for 2.19 and properly isolated behind index setting

  • Increase test coverage - need to add uTs
  • Performance testing - need to show performance of reindexing and updates.
  • Look into optimization

Related Issues

#2377

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Generates the vector source in the source field from the
KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to
our existing custom codec.

Work is still WIP but rootobject is working okay.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 force-pushed the derived-source-vectors-2.x branch from c61bd14 to c7db83e Compare January 28, 2025 15:43
Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 force-pushed the derived-source-vectors-2.x branch from c7db83e to 927b6de Compare January 28, 2025 16:01
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Copy link
Collaborator

@Vikasht34 Vikasht34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the first Pass , A really good shoutout to

  1. Clean Interfaces .
  2. Good use of abstraction on Visitor and clean injection.
    3.Code has good modularity .
  3. Function interface and supplier.

@Override
public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo segmentInfo, FieldInfos fieldInfos, IOContext ioContext)
throws IOException {
List<FieldInfo> derivedVectorFields = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we short circuit the code to return early in case of setting is disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the reader format, due to SPI, we do not have access to KNNSettings. So we cant check if the setting is set or not. I add a shortcircuit if no fields have the attribute: https://github.com/opensearch-project/k-NN/pull/2449/files#diff-f8a9ebad33a21a479b30eb0dfa0bcc6aa7ddfcb6c464eca0371b60d3c3a38e77R49

@Override
public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo segmentInfo, FieldInfos fieldInfos, IOContext ioContext)
throws IOException {
List<FieldInfo> derivedVectorFields = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we lazy create derivedVectorFields else we will have empty list of array even though setting is disabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I can add that in the below for loop.

private boolean shouldInject = true;

@Override
public void document(int docId, StoredFieldVisitor storedFieldVisitor) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document method directly couples vector injection with DerivedSourceVectorInjector. This makes it harder to extend or modify the injection logic. Could we abstracts or implement some loose coupling here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand completely - we do need to create a custom stored fields visitor in order to get access to the source, so I needed to pass it in there. I could add the logics around the fieldsvisitor casting into the DerivedSourceStoredFieldVisitor.

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the reader


@Override
public StoredFieldsReader clone() {
return new DerivedSourceStoredFieldsReader(delegate.clone(), derivedSourceVectorInjector);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will clone affect the refcounts for delegate in any way? Are we sure it will be closed when its supposed to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me double check this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is supposed to heavily delegate. So, when a certain method is called, we want to call the delegate's method. Hence, I believe this is correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Override
public StoredFieldsFormat storedFieldsFormat() {
DerivedSourceReadersSupplier derivedSourceReadersSupplier = new DerivedSourceReadersSupplier(
(segmentReadState) -> knnVectorsFormat().fieldsReader(segmentReadState),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If this is for Lucene engine, you will also open index inputs for graph files, might be a bit unnecessary. you might be able to do

() -> new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()).fieldsReader

The only caveat is that will changing lucene versions it might be hard to keep a trace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any penalty incurred for opening indexInput but not reading from it? My hesitancy to use Lucene99FlatVectorsFormat is that I worry we dont explicitly mention thats how we are storing vectors. So Im leaning towards leaving how it is.

Copy link
Member Author

@jmazanec15 jmazanec15 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this change and ended up getting an error - so might hold off:

[2025-01-28T20:09:15,944][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-0] [original-enable-testnestedmultidocbasiczcfrqz][0] marking and sending shard failed due to [shard failure, reason [refresh failed source[schedule]]]
org.apache.lucene.index.CorruptIndexException: Problem reading index from store(ByteSizeCachingDirectory(HybridDirectory@/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/GfG-oBsOSoy75aVhmKoErg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@47495223)) (resource=store(ByteSizeCachingDirectory(HybridDirectory@/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/GfG-oBsOSoy75aVhmKoErg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@47495223)))
        at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:165) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:545) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:607) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1865) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1844) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4705) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1300) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1444) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955) [opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.io.FileNotFoundException: No sub-file with id .vemf found in compound file "_0.cfs" (fileName=_0.vemf files: [_Lucene912_0.tmd, _Lucene912_0.psm, .fnm, _Lucene90_0.dvd, .kdd, _Lucene912_0.tip, .kdm, _Lucene90_0.dvm, _Lucene912_0.tim, _Lucene912_0.doc, .kdi, _NativeEngines990KnnVectorsFormat_0.vec, .fdm, .fdx, _NativeEngines990KnnVectorsFormat_0.vemf, .fdt])
        at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.openInput(Lucene90CompoundReader.java:170) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.opensearch.knn.index.codec.KNN80Codec.KNN80CompoundDirectory.openInput(KNN80CompoundDirectory.java:50) ~[?:?]
        at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.readMetadata(Lucene99FlatVectorsReader.java:90) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.<init>(Lucene99FlatVectorsReader.java:65) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat.fieldsReader(Lucene99FlatVectorsFormat.java:95) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
        at org.opensearch.knn.index.codec.KNN9120Codec.KNN9120Codec.lambda$storedFieldsFormat$0(KNN9120Codec.java:74) ~[?:?]
        at org.opensearch.knn.index.codec.derivedsource.DerivedSourceReadersSupplier.getReaders(DerivedSourceReadersSupplier.java:36) ~[?:?]
        at org.opensearch.knn.index.codec.derivedsource.DerivedSourceVectorInjector.<init>(DerivedSourceVectorInjector.java:54) ~[?:?]
        at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.createDerivedSourceVectorInjector(DerivedSourceStoredFieldsReader.java:63) ~[?:?]
        at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.<init>(DerivedSourceStoredFieldsReader.java:59) ~[?:?]
        at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.<init>(DerivedSourceStoredFieldsReader.java:44) ~[?:?]
        at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsFormat.fieldsReader(DerivedSourceStoredFieldsFormat.java:57) ~[?:?]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, Thanks for trying. Maybe added a TODO for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any penalty incurred for opening indexInput but not reading from it?

If its already open, then no since its mmapped. But otherwise it will. Was just trying to avoid it

Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 force-pushed the derived-source-vectors-2.x branch from a6fc86e to 74387f9 Compare January 29, 2025 04:22
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@jmazanec15 jmazanec15 force-pushed the derived-source-vectors-2.x branch from 6e104f2 to 399d281 Compare January 29, 2025 17:14
public void inject(int docId, Map<String, Object> sourceAsMap) throws IOException {
KNNVectorValues<?> vectorValues = vectorValuesSupplier.get();
if (vectorValues.docId() == docId || vectorValues.advance(docId) == docId) {
sourceAsMap.put(fieldInfo.name, vectorValues.getVector());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need conditional clone vector here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general, I shouldnt need to clone the vector because it get serialized (and copied) later. Im going to remove it from above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v can you double check my reasoning here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas and @jmazanec15 yes we will need a conditional clone here for the vector because vector values will give vector with same reference always. and For a Map<String, Object> sourceAsMap the map will just store the reference of the vector and won't do any serialization during map insertions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @navneet1v - need to clone

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
@@ -72,6 +74,37 @@ public static <T> KNNVectorValues<T> getVectorValues(final FieldInfo fieldInfo,
return getVectorValues(FieldInfoExtractor.extractVectorDataType(fieldInfo), vectorValuesIterator);
}

/**
* Returns a {@link KNNVectorValues} for the given {@link FieldInfo} and {@link LeafReader}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit-pick] the java doc is not correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

*/
public static <T> KNNVectorValues<T> getVectorValues(
final FieldInfo fieldInfo,
final DocValuesProducer docValuesProducer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need docValuesProducer here? Since this feature will be applicable on newer indexes right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we completely get rid of ability to use docValuesProducer for vectors? I thought it could still be possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the DV is used for older indices and this setting is never going to be enabled for new indices so we don't need this.

@@ -117,6 +128,12 @@ private LuceneFieldMapper(
this.vectorFieldType = null;
}

if (isDerivedSourceEnabled) {
this.fieldType = new FieldType(this.fieldType);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need a copying of fieldType for LuceneFieldMapper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets frozen in Lucene -

return KnnByteVectorField.createFieldType(dimension / Byte.SIZE, VectorSimilarityFunction.EUCLIDEAN);
. So, we need to copy and add. Ill add a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah my bad. For lucene we freeze the field early.

@@ -363,7 +370,8 @@ public Mapper.Builder<?> parse(String name, Map<String, Object> node, ParserCont
modelDaoSupplier.get(),
parserContext.indexVersionCreated(),
null,
null
null,
KNNSettings.isKNNDerivedSourceEnabled(parserContext.getSettings())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question]
should enable this check for specific versions or for all the older versions this setting will be false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will be false for all older versions. Will validate manually by de-registering the setting, creating an index, and then checking this value.

public void inject(int docId, Map<String, Object> sourceAsMap) throws IOException {
KNNVectorValues<?> vectorValues = vectorValuesSupplier.get();
if (vectorValues.docId() == docId || vectorValues.advance(docId) == docId) {
sourceAsMap.put(fieldInfo.name, vectorValues.getVector());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas and @jmazanec15 yes we will need a conditional clone here for the vector because vector values will give vector with same reference always. and For a Map<String, Object> sourceAsMap the map will just store the reference of the vector and won't do any serialization during map insertions

/**
* Factory for creating {@link PerFieldDerivedVectorInjector} instances.
*/
public class PerFieldDerivedVectorInjectorFactory {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this as package private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change

*/
@Override
public StoredFieldsReader getMergeInstance() {
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just give the this instance why we are creating a new instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that for merging, the reader is used. So, if we use this, it will add the vectors back into source

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add this as a java doc here. This seems like a good case where someone needs to know why we need to create merge instance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think covered in abvove comment

Comment on lines +213 to +224
if (fieldInfo.hasNorms() && derivedSourceReaders.getNormsProducer() != null) { // the field indexes norms
iterator = derivedSourceReaders.getNormsProducer().getNorms(fieldInfo);
} else if (fieldInfo.getVectorDimension() != 0 && derivedSourceReaders.getKnnVectorsReader() != null) { // the field indexes vectors
switch (fieldInfo.getVectorEncoding()) {
case FLOAT32:
iterator = derivedSourceReaders.getKnnVectorsReader().getFloatVectorValues(fieldInfo.name);
break;
case BYTE:
iterator = derivedSourceReaders.getKnnVectorsReader().getByteVectorValues(fieldInfo.name);
break;
}
} else if (fieldInfo.getDocValuesType() != DocValuesType.NONE && derivedSourceReaders.getDocValuesProducer() != null) { // the field
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we would have 3 types of field indexes? when this is just vector field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily the vector field. This is for a field in one child before adding back the vector. We are trying to figure out which docId that child maps to. To do it, I get the first field on/after the offset that contains that field.

So, this is basically field exists - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java

Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>

public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_KEY = "knn-derived-source-enabled";
public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_TRUE_VALUE = "true";
public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_FALSE_VALUE = "false";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit-pick] this constant is not used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VijayanB mentioned this. Will remove in another review

);
// setting it explicitly false here to ensure that when flatmapper is used Lucene based Vector field is not created.
this.useLuceneBasedVectorField = false;
this.perDimensionValidator = selectPerDimensionValidator(vectorDataType);
this.fieldType = new FieldType(KNNVectorFieldMapper.Defaults.FIELD_TYPE);
this.fieldType.setDocValuesType(DocValuesType.BINARY);
if (isDerivedSourceEnabled) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we adding only if it is true? is that intentional? if so why do we have DERIVED_VECTOR_FIELD_ATTRIBUTE_FALSE_VALUE ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Im going to remove FALSE in another PR.

} else if (docValuesProducer != null) {
docIdSetIterator = docValuesProducer.getBinary(fieldInfo);
} else {
throw new IllegalArgumentException("Field does not have vector values and DocValues");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Field should have either vector values or DocValues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in another PR

@shatejas
Copy link
Collaborator

Looks good overall

// Setup the iterator. Return if no parent
String childFieldName = ParentChildHelper.getChildField(childFieldInfo.name);
String parentFieldName = ParentChildHelper.getParentField(childFieldInfo.name);
if (parentFieldName == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: out of curiosity, is this just being defensive or are we handling a case here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being defensive

@jmazanec15 jmazanec15 merged commit 59b8e6b into opensearch-project:2.x Jan 29, 2025
100 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to main failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-main main
# Navigate to the new working tree
cd .worktrees/backport-main
# Create a new branch
git switch --create backport/backport-2449-to-main
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 59b8e6bb24124c4f624e4711cf11f89948ffc594
# Push it to GitHub
git push --set-upstream origin backport/backport-2449-to-main
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-main

Then, create a pull request where the base branch is main and the compare/head branch is backport/backport-2449-to-main.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 30, 2025
Generates the vector source in the source field from the
KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to
our existing custom codec.

Currently, feature is experimental and behind a feature flag via index
setting. In the future, we need to iterate to improve performance
and stability for nested/object portions.

Signed-off-by: John Mazanec <jmazane@amazon.com>
(cherry picked from commit 59b8e6b)
jmazanec15 added a commit to jmazanec15/k-NN-1 that referenced this pull request Feb 6, 2025
…#2449)

Generates the vector source in the source field from the
KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to
our existing custom codec.

Currently, feature is experimental and behind a feature flag via index
setting. In the future, we need to iterate to improve performance
and stability for nested/object portions.

Signed-off-by: John Mazanec <jmazane@amazon.com>
jmazanec15 added a commit that referenced this pull request Feb 6, 2025
Generates the vector source in the source field from the
KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to
our existing custom codec.

Currently, feature is experimental and behind a feature flag via index
setting. In the future, we need to iterate to improve performance
and stability for nested/object portions.

Signed-off-by: John Mazanec <jmazane@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport main backport 2.19 Features Introduces a new unit of functionality that satisfies a requirement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants