Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Code to reproduce the disconnectedness in HNSW using open source datasets #236

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nitirajrathore
Copy link
Contributor

Adding this code for someone to checkout and comment but not merge yet. This change picks up changes from earlier pending PR #234.
Added CheckHNSWConnectedness for finding the connectedness of HNSW graph at each level of the graph and also overall disconnected nodes of the graph.

Refactored KnnIndexer out of KnnGraphTester to be able to use standalone and created KnnIndexerMain for it.

##Tests

  1. Use existing ant script or new Gradle task to create the vector files for wiki data using the gloVe data.
    ant vectors300-docs
    ant vectors100-docs
  2. Create the index for 100 dimension and 300 dimension vectors using the KnnIndexerMain
    ./gradlew :src:main:run -PmainClass=knn.KnnIndexerMain --args=" -docvectorspath lucene/benchmarks/data/enwiki-20120502-lines-1k-100d.vec -indexpath lucene/benchmarks/indices/vector_index -maxconn 16 -beamwidth 100 -vectorencoding FLOAT32 -similarityfunction DOT_PRODUCT -numdocs 1000000 -dimension 100"
  3. Check for HNSW connectedness
    ./gradlew :src:main:checkHnswConnected -Pindex="lucene/benchmarks/indices/vector300_index" -Pknn-field="knn"

The script currently only generated single segment indexes. I will try to create multiple segments and check again. But this proves that the there is disconnectedness in graph even when there is less dynamism in creating and updating documents.

Results

100-Dim 1M-vectors

~ 0.4 % disconnectedness

For index /home/nitirar/mywork/lucene/benchmarks/indices/vector100_index field : knn
Level = 0       Total Nodes = 1000000   Reachable Nodes = 996035        Unreachable Nodes = 3965        %Disconnectedness = 0.3965
Level = 1       Total Nodes = 62820     Reachable Nodes = 61442 Unreachable Nodes = 1378        %Disconnectedness = 2.193569
Level = 2       Total Nodes = 3879      Reachable Nodes = 3820  Unreachable Nodes = 59  %Disconnectedness = 1.5210105
Level = 3       Total Nodes = 233       Reachable Nodes = 232   Unreachable Nodes = 1   %Disconnectedness = 0.42918456
Level = 4       Total Nodes = 12        Reachable Nodes = 12    Unreachable Nodes = 0   %Disconnectedness = 0.0
Overall graph :         Total Nodes = 1000000   Reachable Nodes = 996183        Unreachable Nodes = 3817        %Disconnectedness = 0.3817

300-Dim 1M-vectors

~ 1 % disconnectedness

For index /home/nitirar/mywork/lucene/benchmarks/indices/vector300_index field : knn
Level = 0       Total Nodes = 1000000   Reachable Nodes = 990377        Unreachable Nodes = 9623        %Disconnectedness = 0.9623
Level = 1       Total Nodes = 62820     Reachable Nodes = 60100 Unreachable Nodes = 2720        %Disconnectedness = 4.329831
Level = 2       Total Nodes = 3879      Reachable Nodes = 3756  Unreachable Nodes = 123 %Disconnectedness = 3.1709204
Level = 3       Total Nodes = 233       Reachable Nodes = 233   Unreachable Nodes = 0   %Disconnectedness = 0.0
Level = 4       Total Nodes = 12        Reachable Nodes = 12    Unreachable Nodes = 0   %Disconnectedness = 0.0
Overall graph :         Total Nodes = 1000000   Reachable Nodes = 990729        Unreachable Nodes = 9271        %Disconnectedness = 0.9271

Nitiraj Rathore added 2 commits October 16, 2023 11:14
…arate folder. From now the project should be compiled using ./gradlew build

added 2 methods to StringFieldDocSelector to allow compilation with lucene-9.8 and current main branch.
changes for creating knn index independently
renamed the python file with .py extension

import java.io.IOException;
import java.nio.file.Paths;
import java.util.*;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid *-import


public class CheckHNSWConnectedness {
private static int getReachableNodes(HnswGraph graph, int level) throws IOException {
Set<Integer> visited = new HashSet<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this nice IntIntHashMap that stores primitives


visited.add(node);
// emulate collapsing all the edges
for (int level = 0; level < numLevels; level++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for levels > 0 can we iterate over the nodes in the level rather than all the nodes?

@mikemccand
Copy link
Owner

Should we commit what we have here even though it's still draft?

@nitirajrathore
Copy link
Contributor Author

Should we commit what we have here even though it's still draft?

Hi @mikemccand. I will do it in parts. First part is to refactor the existing code for which I have raised the PR. Its mostly refactoring so we can immediately commit with someones help. If you find time please review it. : #254
Once thats done I will prepare second PR.

CC : @msokolov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants