Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add approximate filters #607

Merged

Conversation

finnroblin
Copy link
Contributor

Description

This change adds support for ingesting non-vector fields from an HDF5 dataset into OpenSearch k-NN. It also adds support for writing post-filtering queries. These changes aid the transition from k-NN's perf tool to using OSB for release benchmarking.

Please see OSB Workloads PR 364 for usage examples.

Testing

  • New functionality includes testing

Unit tests were added that cover the new param and runner methods. Additionally all of the functionality has been tested via the parameter files in OSB Workloads PR 364.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -22,7 +22,7 @@ def parse_string_parameter(key: str, params: dict, default: str = None) -> str:

def parse_int_parameter(key: str, params: dict, default: int = None) -> int:
if key not in params:
if default:
if default is not None:
Copy link
Contributor Author

@finnroblin finnroblin Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is technically not related to this PR, but the previous if statement throws a ConfigurationError if default == 0. All unit tests pass in both configurations.

@@ -1254,6 +1254,17 @@ def _get_field_value(content, field_name):
return _get_field_value(content["_source"], field_name)
return None

def binary_search_for_last_negative_1(neighbors):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use a Python utility here but I couldn't find one in the standard library. Any suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not aware of anything. Also, given that the list may not be sorted, it might be harder to do. That being said, I think this is good.

@@ -1270,7 +1281,20 @@ def calculate_topk_search_recall(predictions, neighbors, top_k):
self.logger.info("No neighbors are provided for recall calculation")
return 0.0
min_num_of_results = min(top_k, len(neighbors))
last_neighbor_is_negative_1 = int(neighbors[min_num_of_results-1]) == -1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is necessary when the dataset is first generated and then neighbors are filtered out. This occurs in some of the perf tools datasets like here. These datasets first generate true neighbors for the test vector, such as [5, 2, 3], and then post-filter the neighbors with attributes that do not pass the filter. The documents without the attributes are replaced by -1. However looking more closely at the dataset generation code I'm not sure that all of the -1 entries are at the end of the array. @martin-gaievski I see you wrote the filtering code, could you please clarify? Thanks

Copy link
Member

@martin-gaievski martin-gaievski Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should be at the end of collection of nearest neighbors for each query, although I'm not clearly remember sorting them. Can you share example where -1 are in the row, but they are mixed with some other ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any in testing, but wanted to ask since I couldn't find a sort. I took another look at the code for filtering neighbors and it looks like the initial array with all -1 entries is overwritten from left to right, so any remaining -1 ids are at the end. I don't think we'll use these filter datasets anyways (since Heemin's nested fields script generates k true neighbors after filtering instead of generating k true neighbors and then post-filtering to have <k neighbors), but I'll call out the assumption that all -1 are at the end in a comment.

"script_score": {
"query": {"bool": {"filter": filter_body}},
"script": {
"source": "knn_score",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use knn script score here?

Copy link
Contributor Author

@finnroblin finnroblin Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in OSB since it's mentioned in the filters documentation but I haven't written a workload yet since Vijay said it wasn't a priority. Is Knn script score something that we would like to benchmark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with @VijayanB and it seems like @navneet1v is adding a feature in 2.17 where exact search can be performed via a knn query (as opposed to the script score query). This change should be much easier to benchmark than exact search via a script_score query since it would just require a new workload without any OSB runner changes. I'll remove this script score code in the next revision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@finnroblin but we will still need script score based benchmarks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15, @finnroblin do we not want to add script score based benchmarks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shoud have this as well. But I was thinking that exact search should use new method for implementing it

truth_set = neighbors[:min_num_of_results]
if last_neighbor_is_negative_1:
self.logger.info("Last neighbor is -1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this logging, if you want please make it debug

@VijayanB
Copy link
Member

@finnroblin Can you rebase this PR?

Signed-off-by: Finn Roblin <finnrobl@amazon.com>
@VijayanB VijayanB merged commit d039711 into opensearch-project:main Aug 30, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants