Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move max vector dims limit to Codec #12436

Merged
merged 3 commits into from
Jul 27, 2023

Conversation

mayya-sharipova
Copy link
Contributor

Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.

Closes #12309

Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimenstions that they can handle.

Closes apache#12309
pf.fieldName,
s.vectorDimension,
indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not going to do what we want when a PerFieldKnnVectorsFormat is used, as this would check the limit on PerFieldKnnVectorsFormat, rather than on the actual format that is used for pf.fieldName. Maybe getMaxDimensions should be on KnnVectorsWriter and we could forward to pf.knnVectorsWrtier here for checking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my suggestion wouldn't work, as the writer would already be created with the number of dimensions of the field type when we run the check. So I guess we either need to add the field name to getMaxDimensions or make the codec responsible for performing the check rather than IndexingChain.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thanks for your feedback, you are completely right, I missed about PerFieldKnnVectorsFormat.

In 21ebd51, I've added a field name to the getMaxDimensions method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it. What about making the codec responsible for checking the limit? Something like below:

diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
index cb3e5ef8b10..6c365e53528 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
@@ -108,6 +108,9 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
   public static final int VERSION_START = 0;
   public static final int VERSION_CURRENT = VERSION_START;
 
+  /** A maximum number of vector dimensions supported by this codeс */
+  public static final int MAX_DIMENSIONS = 1024;
+
   /**
    * A maximum configurable maximum max conn.
    *
@@ -177,7 +180,7 @@ public final class Lucene95HnswVectorsFormat extends KnnVectorsFormat {
 
   @Override
   public KnnVectorsWriter fieldsWriter(SegmentWriteState state) throws IOException {
-    return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth);
+    return new Lucene95HnswVectorsWriter(state, maxConn, beamWidth, MAX_DIMENSIONS);
   }
 
   @Override
diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
index 5358d66f16e..196f12a21ad 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
@@ -60,13 +60,15 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter {
   private final IndexOutput meta, vectorData, vectorIndex;
   private final int M;
   private final int beamWidth;
+  private final int maxDimension;
 
   private final List<FieldWriter<?>> fields = new ArrayList<>();
   private boolean finished;
 
-  Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth) throws IOException {
+  Lucene95HnswVectorsWriter(SegmentWriteState state, int M, int beamWidth, int maxDimension) throws IOException {
     this.M = M;
     this.beamWidth = beamWidth;
+    this.maxDimension = maxDimension;
     segmentWriteState = state;
     String metaFileName =
         IndexFileNames.segmentFileName(
@@ -117,6 +119,9 @@ public final class Lucene95HnswVectorsWriter extends KnnVectorsWriter {
 
   @Override
   public KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo) throws IOException {
+    if (fieldInfo.getVectorDimension() > maxDimension) {
+      throw new IllegalArgumentException("Number of dimensions " + fieldInfo.getVectorDimension() + " for field " + fieldInfo.name + " exceeds the limit of " + maxDimension);
+    }
     FieldWriter<?> newField =
         FieldWriter.create(fieldInfo, M, beamWidth, segmentWriteState.infoStream);
     fields.add(newField);

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Thank you for the additional feedback.

I worry that this adds a hashtable lookup on a hot code path. Maybe it's not that bad for vectors, which are slow to index anyway, but I'd rather avoid it.

This is not really a hot code path. We ask for getCodec().knnVectorsFormat().getMaxDimensions in the initializeFieldInfo function, that happens only once per a new field per segment.

What about making the codec responsible for checking the limit?

Thanks for the suggestion, I experimented with this idea, and encountered the following difficulty with it:

  • we need to create a new FieldInfo before passing it to KnnFieldVectorsWriter<?> addField(FieldInfo fieldInfo).
  • The way we create it is : FieldInfo fi = fieldInfos.add( by adding to the global fieldInfos. This means that if FieldInfo contains incorrect number of dimensions, it will be stored like this in the global fieldInfos, and we can't change it (for example with a second document with correct number of dims).

May be as an alternative we can do a validation as a separate method of KnnVectorsWriter:

public void validateFieldDims(int dims)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh thanks for explaining, I had not fully understood how your change worked. I like that we're retaining the property that the field info doesn't even get created if its number of dimensions is above the limit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this looks fine. The hashtable lookup is no issue at all.

As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.

I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.

Add a field name for getMaxDimensions function
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova, it took me time to understand how your change works but I think it's good, in particular the fact that it's transactional: a field will not make it to the IndexWriter's field infos if it has a number of dimensions above the limit.

@uschindler I'm curious what you think of this change since you seemed to support the idea of moving the limit to the codec in the past.

* @param fieldName the field name
* @return the maximum number of vector dimensions.
*/
public int getMaxDimensions(String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In main branch this should be abstract, only in 9.x we should have a default impl.

@@ -80,6 +80,11 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException
return new FieldsReader(state);
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the only subclass of KnnVectorsFormat that we have?

In addition, we should also add explicit number of dimensions in backwards codecs, because at the time when it was implemented 1024 was their default. In backwards codec the method should be final, IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have SimpleTextKnnVectorsFormat too.

"f",
new float[getVectorsMaxDimensions("f") + 1],
VectorSimilarityFunction.DOT_PRODUCT));
Exception exc = expectThrows(IllegalArgumentException.class, () -> w.addDocument(doc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically the check is now delayed to addDocument()? Great!

I was afraid that the check comes delayed while indexing of flushing is happening. So to me this looks good.

pf.fieldName,
s.vectorDimension,
indexWriterConfig.getCodec().knnVectorsFormat().getMaxDimensions());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this looks fine. The hashtable lookup is no issue at all.

As getMaxDimensions() is a public codec method, we can let us do the check here. A separate validateFieldDims() is not needed.

I only don't like the long call chain to actually get the vectors format from the indexWriterConfig. But I can live with that.

@uschindler
Copy link
Contributor

This looks fine to me. I am happy that we do not have stupid system properties.

Somebody who wants to raise the limit (or lower it to 32 like in the test) can simply implement an own codec.

One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name.

@mayya-sharipova
Copy link
Contributor Author

Thanks @jpountz and @uschindler for the reviews. I will do the following:

  • merge this PR in the current form to main and branch_9x.
  • create a separate PR on top of it for the main branch only to make KnnVectorsFormat#getMaxDimensions abstract and all codecs provide their own implementation of it.

@mayya-sharipova mayya-sharipova merged commit 98320d7 into apache:main Jul 27, 2023
4 checks passed
@mayya-sharipova mayya-sharipova deleted the max-vector-dims-to-codec branch July 27, 2023 18:50
mayya-sharipova added a commit that referenced this pull request Jul 27, 2023
Move vector max dimension limits enforcement into the default Codec's
KnnVectorsFormat implementation. This allows different implementation
of knn search algorithms define their own limits of a maximum
vector dimensions that they can handle.

Closes #12309
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jul 28, 2023
…ensionTooLarge

Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.

Relates to apache#12436
mayya-sharipova added a commit to mayya-sharipova/lucene that referenced this pull request Jul 28, 2023
…ensionTooLarge

Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.

Relates to apache#12436
mayya-sharipova added a commit that referenced this pull request Jul 28, 2023
…ensionTooLarge (#12467)

Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.

Relates to #12436
mayya-sharipova added a commit that referenced this pull request Jul 28, 2023
…ensionTooLarge (#12467)

Depending whether a document with dimensions > maxDims created
on a new segment or already existing segment, we may get
different error messages. This fix adds another possible
error message we may get.

Relates to #12436
@jpountz
Copy link
Contributor

jpountz commented Jul 28, 2023

One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name.

This change doesn't touch the read logic, it only adds write-time validation. So if you first index vectors above the default limit using a custom codec that reuses the same codec name as the default codec and then open this index with the default codec, you will be able to perform reads, but writes will fail. Using a different codec name would certainly make things simpler, to force Lucene to use the same codec at read time instead of the default codec.

@uschindler
Copy link
Contributor

One question I have: What happens if you open an index with a higher limit in field infos and you use default codec? I think this is unsupported, but in that case the implementor of the codec should possibly use own codec name.

This change doesn't touch the read logic, it only adds write-time validation. So if you first index vectors above the default limit using a custom codec that reuses the same codec name as the default codec and then open this index with the default codec, you will be able to perform reads, but writes will fail. Using a different codec name would certainly make things simpler, to force Lucene to use the same codec at read time instead of the default codec.

That's what I expected. Thanks.

Comment on lines +846 to +847
+ "]"
+ "vector's dimensions must be <= ["
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: #12605 proposes to have a space before the vector's dimensions words here.

@mkleen mkleen mentioned this pull request Oct 2, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move aKNN limits enforcement into the default Codec's KnnVectorsFormat implementation
5 participants