-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index Scan only returning 51 rows to surrounding filtered query with num_neighbors default of 50. #110
Comments
@mgrosso sorry for the delay in responding. My initial impression is that this is a very weird, non-representative dataset. First of all because the number of dimensions is only 3 and second because the rows repeat a lot. The latter especially is problematic because there will be a lot of rows with a distance of 0. The way the graph algorithm works is that all nodes store their closest neighboring nodes first, and then the further nodes after that. Given that there would be 4096 nodes with a distance of 0 that means a num_neighbors of 50, you can see how that may be a problem. What I am guessing is happening here is actually that the graph is becoming disconnected. Before we debug further may I ask if this is the actual dataset you plan to use and if so, why? I am guessing you simply won't see the issues if you use a real embedding dataset. |
Hi @cevian, This is definitely a test dataset intended to prove that we can filter past a large number of results to get to the best results (approximately) even when a large number of closer results exist that don't match the filter, since we do expect that scenario to happen sometimes for our use case. I'll reconstruct test data using some random noise so we don't have large numbers of dupes but we still challenge the ability to filter past a sizable percentage of closer vectors. If that reproduces, then I can see if the vector dimension makes a difference since I expect to use more typical embedding dimensions. Using a dimension of three just made the test cases easier to construct. side note: I wouldn't expect to see this issue with the benchmark data sets I've found so far, but I also haven't found any that I feel do a good job of representing the relevance correlated filtering weirdness of enterprise data at scale, and I was specifically worried about the ability to filter past a large number of rows in a diskann index scan when necessary, having been burnt with HNSW. |
@cevian I've replicated the issue with a dimension of 128, but keeping the data in 8 clusters to ensure we can replicate the main issue where the query has to find rows from the second most relevant cluster, requiring a significant number of rows to be returned by the index scan. The To prevent a any distances of 0 I've added jitter to every element. I'll add download links and detailed SQL steps and output shortly, as well as code to regenerate the data set. |
@cevian I've provided steps to replicate, test data, and code to regenerate the test data all here: https://github.com/mgrosso/issue110-pgvectorscale |
@mgrosso thanks for that thorough replication. This line
actually gives me a good clue as to what is happening because it returns only 51 rows. I think the first node in the graph has a cluster that's cut off from the rest of the graph. While the algorithm in general has no guarantees of connectedness I think there should be something I can do to fix this issue (I think its an edge case somewhere). Let me take a look and think about this a bit) |
Just a note: I am still working on this |
51 rows returned issue happened to me as well |
@raulcarlomagno is that on the release or after pr #111 ? |
|
i can add more information
|
@cevian I'll definitely give this a test this weekend just to be sure, but it sounds like you were able to replicate the problem with more or less the same test data and repair it so I'm sure I'll get the same positive results. |
please feel free to close this when that gets merged or whenever it makes sense for your current process. |
The fix to this has been released in 0.3.0 |
(Note: This bug was discovered while documenting #109 and most of my first 13 comments there relate to this.)
index scan terminates early
summary
num_neighbors
, an index scan of adiskann
index will terminate after 51 rows; a query in which that index scan is a child of a filter that filters the 51 most relevant rows will not return any rows at all.num_neighbors
build settings from 102 through 108 the maximum rows returned by the index scan increases in non-linear and non-monotonic fashion detailed below, at least with the replication steps I followed.num_neighbors
to 109 fixes the issue for the replication steps I went through, even when the number of rows in the table was increased to 2^20.replication steps
tl;dr: we insert 8 distinct rows, then double them until we have 32768, giving us 4k relevant rows to filter before we get to the less relevant rows that match our filter.
test data set
Those initial rows will look like this:
I created them more or less like this:
replicate the problem
here we create the index, and the run two different queries, each of those two queries with and without
explain analyze
to confirm the diskann is being used. TheEXPLAIN ANALYZE
also let's us confirm that when we get 0 rows back instead of 10 or when the count is wrong it's because of the number of rows returned by the index scan.create the index.
replicate the problem using a simple outer count(*) query with no filters:
replicate the problem using an outer query that checks filters, intended to duplicate the enterprise use case in which a filter is highly correlated with relevance but the most relevant documents are filtered for permissions or other reasons:
exploring the parameter space
if you repeatedly drop the index and rebuild it using a
WITH (num_neighbors = $X)
for different values of X, here is what you should get, using the dataset and steps above:see #109 (comment)
mitigation
setting
num_neighbors
to 109 fixed the issue for my test datasets so far, eg:CREATE INDEX test_embedding_diskann_109 ON test_embedding USING diskann (embedding) WITH (num_neighbors=109);
The text was updated successfully, but these errors were encountered: