Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseVectorSimilarityQueryTestCase assumes connected hnsw graph #13260

Merged
merged 4 commits into from
Apr 4, 2024

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Apr 3, 2024

…rityQueryTestCase

Description

@msokolov msokolov changed the title BaseVectorSimilarityQueryTestCase assumes connected hnsw graph to BaseVectorSimilarityQueryTestCase assumes connected hnsw graph Apr 3, 2024
@@ -165,6 +171,9 @@ public void testRandomFilter() throws IOException {

try (Directory indexStore = getIndexStore(getRandomVectors(numDocs, dim));
IndexReader reader = DirectoryReader.open(indexStore)) {
if (graphIsDisconnected(reader)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use assumeFalse(graphIsDisconnected(reader)) as this will also log a short notice that the test was not executed at all.

@@ -522,4 +534,20 @@ final Directory getIndexStore(V... vectors) throws IOException {
}
return dir;
}

private boolean graphIsDisconnected(IndexReader reader) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit crazy.... Maybe move it to the test-util, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe we want to use in some other test class too. Ill refactor and pull out the field name

Copy link
Contributor

@shubhamvishu shubhamvishu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @msokolov for fixing this!.....I left some comments

Comment on lines 92 to 115
private static int nextClearBit(FixedBitSet bits, int index) {
// Depends on the ghost bits being clear!
long[] barray = bits.getBits();
assert index >= 0 && index < bits.length() : "index=" + index + ", numBits=" + bits.length();
int i = index >> 6;
long word = ~(barray[i] >> index); // skip all the bits to the right of index

if (word != 0) {
return index + Long.numberOfTrailingZeros(word);
}

while (++i < barray.length) {
word = ~barray[i];
if (word != 0) {
int next = (i << 6) + Long.numberOfTrailingZeros(word);
if (next >= bits.length()) {
return NO_MORE_DOCS;
} else {
return next;
}
}
}
return NO_MORE_DOCS;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be motivated from FixedBitSet#nextSetBit and would be more suitable for FixedBitSet class?....I wonder why it does not have it already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could go there - it's a near clone of FixedBitSet.nextSetBit but I guess the motivation is to limit the surface area of supported functions. Perhaps if this gets wider use we could move it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one would also want to discourage its use in performance-sensitive code because it requires extra work beyond what nextSetBit does

* Returns the sizes of the distinct graph components on level 0. If the graph is fully-connected
* there will only be a single component. If the graph is empty, the returned list will be empty.
*/
public static List<Integer> componentSizes(HnswGraph hnsw) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the purpose of this function is to be a utility for future use cases(if any) because if we only need to know the graph is fully connected we could just pick a node and traverse only once to check if its size is equal to graph size i.e. early terminate instead of going through all components. Alternatively, I think we should keep componentSizes as-is but implement isFullyConnected like below?

public static boolean isFullyConnected(HnswGraph hnsw) throws IOException {
    FixedBitSet connectedNodes = new FixedBitSet(hnsw.size());
    assert hnsw.size() == hnsw.getNodesOnLevel(0).size();
    System.out.println("size=" + hnsw.size());
    return connectedNodes.length() == traverseConnectedNodes(hnsw, connectedNodes);
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a tiny bit if lookahead. I would like to be able to use this to identify distinct components so we can patch them up while merging. TBH I look at this whole class as somewhat provisional and expect it to eventually move into the normal util package, and get tested better. In the mean time I agree we can reduce the cost of isFullyConnected in the best case; I can do that. But I'd like to hold on to the componentSizes method even though it will then be unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to keep it this way (counting components) is if we want to extend it to work with liveDocs or some other Bits filter then we won't necessarily know the expected size in advance, but the topological invariant will still hold.

@uschindler
Copy link
Contributor

I think the Unwrappable for FilterReader should be added in a separate PR. For now this is fine.

I remember vaguely that I left our FilterReader because of the static method conflicting with the interface. This may be a good job for cleanup 10.x (remove static and just implement interface).

@msokolov msokolov merged commit 342c814 into apache:main Apr 4, 2024
3 checks passed
@msokolov
Copy link
Contributor Author

msokolov commented Apr 4, 2024

thanks for the reviews!

@msokolov
Copy link
Contributor Author

msokolov commented Apr 4, 2024

hm a failure popped up when I pushed:

org.apache.lucene.search.TestKnnByteVectorQuery > testTimeout FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([AE2970724EFAD848:FE11FBC92B12FA54]:0)

I don't think it's related but I'll check

@benwtrent
Copy link
Member

@msokolov I have had something similar fail in other code, so it isn't related to your change here. I think its a flaky test. It doesn't reliably reproduce for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants