Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose FlatVectorsFormat #13469

Merged
merged 7 commits into from
Jun 13, 2024
Merged

Expose FlatVectorsFormat #13469

merged 7 commits into from
Jun 13, 2024

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Jun 7, 2024

just posting this here for discussion purposes; API is up for discussion

@@ -217,6 +220,18 @@ public ByteVectorValues getByteVectorValues(String field) throws IOException {
vectorData);
}

@Override
public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm as I work with this I am realizing this search interface is not really what I'm after ... basically I would like to be able to combine a DocIdSet from a Query with a FlatVectorScorer so that I can score the document using a quantized dot-product. So I think what I need is a RandomVectorScorer returned from the FlatVectorsReader and some way to go from docid->ord for my vector field. But it seems we only support ordToDoc now and not the other way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov getByteVectorValues and getFloatVectorValues already do this. They provide an iterator that will return VectorScorer which can score and forward iterate via doc ids (supports advance, next, etc.).

Also, with the "flat codecs" it doesn't make much sense to gather up the accepted docs, place the results in a collector, etc. You want to score while you iterate. It would likely be way cheaper than doing it all up front like knn is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, OK. So I think this will work like this if I leave the search() methods as throwing UnsupportedOperationException, then use PerFieldKnnVectorValues to tie a field to a Lucene99ScalarQuantizedVectorsFormat, at which point I can call getFloatVectorValues and use the returned scorer's score() method. Or perhaps call getQuantizedVectorValues directly and use its score method. I'll give it a try.

I'm OK with all that. If this becomes popular we might want to find a way that doesn't require using codec-specific classes, but we can let that sit for now I think. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov I suppose that is ok for folks that understand the depths of codec formats. But we are seriously complicating somebody's life that wants to use kNN and scalar quantized flat codec. Now they cannot use the knn query, but instead have to use some specialized query they come up with themselves.

I think we at least should have a "flat vector query"? or something? Something that always does exact search so that typical users don't have to do this iterating & casting stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov I suppose that is ok for folks that understand the depths of codec formats. But we are seriously complicating somebody's life that wants to use kNN and scalar quantized flat codec. Now they cannot use the knn query, but instead have to use some specialized query they come up with themselves.

+1 on this comment.

I think we at least should have a "flat vector query"? or something? Something that always does exact search so that typical users don't have to do this iterating & casting stuff?

Rather than having a new query can we just reuse the old query where for a field we get the right KNNVectorsReader and hit the search interface of the reader that does the exact search or HNSW based search based on whatever implementation the KNNVectorsReader have used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you all, this seems a little weird. One thing is ... you cannot get access to this stuff at all unless you are willing to create a custom Codec. So by definition we have weeded out people who don't want to mess around with Codecs. Another thing is ... we don't really want to create a trap for people who are trying to find top N vector-distance documents matching some constraint. They might think by blindly following the API trail/trap that they should create a BitSet representing their constraint and pass that in to a Knn*VectorQuery? When instead they would be better off executing their constraint as the primary Query and using the VectorValues API to do the scoring. I'm struggling to see the use case for this brute-force query; it just seems like a trap to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make a KnnVectorScoreQuery that wraps another Query, supplying the vector score. Actually I think you can already do this with FunctionScoreQuery + a DoubleValuesSource wrapping a KnnVectorScorer? Perhaps we should provide that as a single Query?

@msokolov msokolov added this to the 9.12.0 milestone Jun 10, 2024
@msokolov msokolov changed the title WIP expose FlatVectorsFormat Expose FlatVectorsFormat Jun 10, 2024
@@ -189,6 +191,18 @@ public ByteVectorValues getByteVectorValues(String field) throws IOException {
return rawVectorsReader.getByteVectorValues(field);
}

@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, doing this means the KnnFloatVectorQuery will throw an error if its used against a field with a codec using this reader.

Traditionally, query's were fairly ignorant of the underlying codec used to store the field.

It does seem weird to have a codec that the Knn queries could be executed against throw an exception.

I think either:

  • The knn queries should automatically do the correct things (e.g. not eagerly rewrite themselves, etc.)
  • We don't throw on search like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The knn queries should automatically do the correct things (e.g. not eagerly rewrite themselves, etc.)

I am not sure how to do this other than having something that says "Supports approximate search" to the leaf readers. Or a try{}catch() in the query, but then we are using exceptions as a query flow, which seems trappy and bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did toy with adding a "supportsSearch" flag when this problem surfaced due to BaseKnnVectorsFormatTestCase. Potentially we could return no results. I feel like this is the moral equivalent of a stored Field with IndexOptions.NONE (Like StoredField.TYPE)? But we have tended to not want to make such vector-representation choices a part of the Fields API, so it looks different.

@msokolov
Copy link
Contributor Author

I changed this to return zero results from search() rather than throwing UnsupportedOperationException to be more in line with how we treat searches over other kinds of fields that only have stored data that is not indexed.

I also proposed a query (VectorScoreQuery?) that could be used to wrap another Query and provide the score using a vector scorer, but I think we can handle that as a separate issue.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns. With the desire of having a better brute-force query, I am coming around to the idea :)

Comment on lines 222 to 226
@Override
public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs)
throws IOException {
// don't scan stored field data. If we didn't index it, produce no search results
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the case for all FlatVectorsReader objects, think we should put this override there?

Comment on lines 195 to 204
public void search(String field, float[] target, KnnCollector knnCollector, Bits acceptDocs)
throws IOException {
// don't scan stored field data. If we didn't index it, produce no search results
}

@Override
public void search(String field, byte[] target, KnnCollector knnCollector, Bits acceptDocs)
throws IOException {
// don't scan stored field data. If we didn't index it, produce no search results
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be in FlatVectorsReader

Comment on lines 82 to 83
public Lucene99FlatVectorsFormat(String name, FlatVectorsScorer vectorsScorer) {
super(name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why for Lucene99FlatVectorsFormat we allow an external name to be provided. Is this because it isn't really ever loaded via the SPI?

Seems like Lucene99FlatVectorsFormat shouldn't accept a name as a parameter and simply provide super with the correct name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. TBH I kind of just filled out the blanks with this one without thinking much about it, but I agree the name is really only useful for the SPI interface.

@@ -128,7 +129,7 @@ public HnswBitVectorsFormat(
} else {
this.mergeExec = null;
}
this.flatVectorsFormat = new Lucene99FlatVectorsFormat(new FlatBitVectorsScorer());
this.flatVectorsFormat = new Lucene99FlatVectorsFormat(NAME_FLAT, new FlatBitVectorsScorer());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why a unique name here is required.

The top level format name is HnswBitVectorsFormat, and it doesn't really do anything to the inner format, it simply overrides the scoring mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the subject of the brute-force query I started to look at adding one and found we already have o.a.l.queries.function.valuesource.VectorSimilarityFunction -- I think maybe all we need here is a static convenience method that produces a FunctionQuery wrapping one of these things although I'll confess I'm not entirely conversant with this API. It seems like it might want to be rewritten in terms of VectorScorer.

@msokolov
Copy link
Contributor Author

for the failed test here I'll add a test assumption that the randomly-selected vector format be one that supports search (ie not a FlatVectorsFormat) since the test requires that.

@msokolov msokolov merged commit 487d24a into apache:main Jun 13, 2024
3 checks passed
@msokolov msokolov deleted the flat-vectors-format branch June 15, 2024 17:33
@msokolov
Copy link
Contributor Author

I will add a CHANGES entry and backport to 9.x

msokolov added a commit to msokolov/lucene that referenced this pull request Jun 17, 2024
@navneet1v
Copy link
Contributor

@msokolov and @benwtrent I see that we exposed the FlatVectorsFormat but I am not seeing entry for this format in this file: https://github.com/apache/lucene/blob/main/lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.KnnVectorsFormat which is used by SPI to load the format. I am missing something here?

@msokolov
Copy link
Contributor Author

Hi @navneet1v, I think you may be right although TBH I find the way Lucene handles formats with SPI to be somewhat confusing. We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach

@navneet1v
Copy link
Contributor

Hi @msokolov thanks for sharing your thoughts.

I find the way Lucene handles formats with SPI to be somewhat confusing.

+1. but I always feel my experience with SPI is low hence I might feel that.

We're using this in our own service with a custom Codec that seems to not require the META-INF/SPI lookup, bnut maybe there is some shortcoming to that approach

Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of.

@msokolov
Copy link
Contributor Author

Would like to know how you are doing it if there is some reference. Because I am also using a custom codec. The one way I figured out to use this is I create my own KNNVectorsFormat and then used the FlatWriters and Readers directly. Let me know if there is any better way. But this is best I can think of.

I think we need to add the format to SPI. That way we will be able to evolve it going forward and let the index reading code pick the proper implementation corresponding to what was written to the index. Making the Codec figure this out is probably just a short-term hack that won't scale well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants