Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Nested Function Use In WHERE Clause #267

Conversation

forestmvey
Copy link

@forestmvey forestmvey commented May 17, 2023

Description

Syntax: nested( [field] | [field,path] )

Add support for use of the nested function in the WHERE clause as a predicate expression. Supports nested function on left of operator, and literal on right. When using nested function in WHERE clause, the inner hits of the query are not added to the nested query DSL. To do this you must use the nested function in conjunction with a SELECT clause nested function call. See documentation for this nested implementation HERE (WIP).

Example Queries

SELECT message.info FROM nested_objects WHERE nested(message.info, message) = 'a';
SELECT nested(message.info) FROM nested_objects WHERE nested(message.info, message) = 'a';
SELECT message.info FROM nested_objects WHERE nested(message.info) = 'a' OR nested(comment.data) = 'b' AND nested(message.dayOfWeek) = 4;

Issues Resolved

Issue: 1111

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@forestmvey forestmvey requested a review from a team May 17, 2023 22:36
@forestmvey forestmvey force-pushed the dev-nested-where-clause-predicate-expression branch from 0beef98 to 14b8bba Compare May 17, 2023 22:36
@forestmvey forestmvey requested a review from acarbonetto as a code owner May 17, 2023 22:36
@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Merging #267 (6881b3a) into integ-nested-where-clause-predicate-expression (8e5d766) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@                                 Coverage Diff                                  @@
##             integ-nested-where-clause-predicate-expression     #267      +/-   ##
====================================================================================
+ Coverage                                             97.16%   97.18%   +0.01%     
- Complexity                                             4120     4150      +30     
====================================================================================
  Files                                                   371      372       +1     
  Lines                                                 10373    10429      +56     
  Branches                                                704      716      +12     
====================================================================================
+ Hits                                                  10079    10135      +56     
  Misses                                                  287      287              
  Partials                                                  7        7              
Flag Coverage Δ
sql-engine 97.18% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ain/java/org/opensearch/sql/analysis/Analyzer.java 100.00% <ø> (ø)
...l/opensearch/request/OpenSearchRequestBuilder.java 100.00% <100.00%> (ø)
...arch/storage/script/filter/FilterQueryBuilder.java 100.00% <100.00%> (ø)
...search/storage/script/filter/lucene/LikeQuery.java 100.00% <100.00%> (ø)
...arch/storage/script/filter/lucene/LuceneQuery.java 100.00% <100.00%> (ø)
...arch/storage/script/filter/lucene/NestedQuery.java 100.00% <100.00%> (ø)
...java/org/opensearch/sql/sql/parser/AstBuilder.java 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +360 to +365
private Supplier<NestedQueryBuilder> createEmptyNestedQuery(String path) {
return () -> {
NestedQueryBuilder nestedQuery = nestedQuery(path, matchAllQuery(), ScoreMode.None);
((BoolQueryBuilder) query().filter().get(0)).must(nestedQuery);
return nestedQuery;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this lambda function do? How is it different from a without the lambda function around it such as:

NestedQueryBuilder nestedQuery = nestedQuery(path, matchAllQuery(), ScoreMode.None);
      ((BoolQueryBuilder) query().filter().get(0)).must(nestedQuery);
      return nestedQuery;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The Supplier is an interface to fulfill the orElseGet() call in findNestedQueryWithSamePath(). We don't want to use orElse() and not use the Supplier cause that would have the createEmptyNestedQuery() code always execute. Instead we use orElseGet() which takes the Supplier interface and avoids unnecessary code execution with a non-empty optional from findAny().

You can read more about this here:
https://www.baeldung.com/java-optional-or-else-vs-or-else-get

Copy link

@MaxKsyunz MaxKsyunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered implementing nested similarly to relevance search functions?

In particular it's unexpected that generic classes like LuceneQuery and OpenSearchRequestBuilder needed changes.

I'd like to understand this before approving this PR.

Copy link

@GumpacG GumpacG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nested is supported in functions for WHERE clause, could you also add IT for these cases? For example, SELECT * FROM nested WHERE nested(message.info) LIKE 'a'.

Comment on lines 113 to 114
} else if (query != null && query.isNestedFunction(func)) {
return handleNestedPredicateExpression(func, query);
Copy link

@dai-chen dai-chen May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also trying to figure out best way. As I understand, what we want to do is:

if isNestedFunc(left) {
  // convert left from nested(A) to A and keep others the same?
  return new NestedQuery(
    visitFunction(
      new Function(func.name(), left.argument(0), right)));
}

Is it a little better to move all logic to a NestedQuery class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close to a solution. We could swap around the arguments in the ExpressionAnalyzer:visitFunction to turn this function:
Function(=) : { nested(...), literal(...)}
Into this:
Function(nested) : { function(=) : { nested(...), literal(...)}

If we do something like this we can move most of the logic to a class extension of LuceneQuery and override a build() function. It's a tradeoff as we will need logic to handle only this case in ExpressionAnalyzer and the overriding class handling the nested specific logic will need to use the luceneQueries map to build the query function. The code isn't very extensible in pushDownFilter for changing the left parameter of a predicate expression to something other than a ReferenceExpression.

Both implementations will have their drawbacks but without a refactor I can't see a clean implementation for this syntax. At least this isn't a large change that would require much work to refactor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the current solution in this PR more as it may be easier to refactor later if needed. I find swapping the expression tree more clunky with the only benefit being able to move logic to NestedQuery class. Do you have an opinion @dai-chen or any other comments?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for confusion. I'm not suggesting to do this in ExpressionAnalyzer. The pseudocode above is just for this class at this line. I'm just thinking can we move handleNestedPredicateExpression, getNestedPathString, buildNested to a NestedQuery class extends LuceneQuery?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I found a way to move all logic to a NestedQuery class extension of LuceneQuery. Thanks for your input.

Comment on lines 64 to 69
public boolean isNestedFunction(FunctionExpression func) {
return ((func.getArguments().get(0) instanceof FunctionExpression
&& ((FunctionExpression)func.getArguments().get(0))
.getFunctionName().getFunctionName().equalsIgnoreCase(BuiltinFunctionName.NESTED.name())));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a util method which seems not supposed to be a new API?

@forestmvey forestmvey force-pushed the dev-nested-where-clause-predicate-expression branch from e2b7896 to 6d0c249 Compare May 25, 2023 15:18
Signed-off-by: forestmvey <forestv@bitquilltech.com>
@forestmvey forestmvey force-pushed the dev-nested-where-clause-predicate-expression branch from 6d0c249 to 4d933c5 Compare May 25, 2023 15:20
*/
private String getNestedPathString(ReferenceExpression field) {
String ret = "";
for (int i = 0; i < field.getPaths().size() - 1; i++) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could do something like

List<String> tmpPaths = field.getPaths();
tmpPaths.remove(tmpPaths.size()-1);
return String.join(".", tmpPaths);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Need to wrap it inside an if statement to make sure there's at least one item in the list (for Guian's suggestion).

Signed-off-by: forestmvey <forestv@bitquilltech.com>
);
}

for (var arg : nestedFunc.getArguments()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, use a .stream().filter(check for instance).findFirst().isPresent(throw)

@forestmvey forestmvey merged commit adaaf49 into integ-nested-where-clause-predicate-expression May 25, 2023
@forestmvey forestmvey deleted the dev-nested-where-clause-predicate-expression branch May 25, 2023 18:12
andy-k-improving pushed a commit that referenced this pull request Nov 16, 2024
Signed-off-by: Heemin Kim <heemin@amazon.com>
andy-k-improving pushed a commit that referenced this pull request Nov 16, 2024
* Implement creation of ip2geo feature (#257)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Added unit tests with some refactoring of codes (#271)

* Add Unit tests
* Set cache true for search query
* Remove in memory cache implementation (Two way door decision)
 * Relying on search cache without custom cache
* Renamed datasource state from FAILED to CREATE_FAILED
* Renamed class name from *Helper to *Facade
* Changed updateIntervalInDays to updateInterval
* Changed value type of default update_interval from TimeValue to Long
* Read setting value from cluster settings directly

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Sync from main (#280)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Exclude lombok generated code from jacoco coverage report (#268)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make jacoco report to be generated faster in local (#267)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update dependency org.json:json to v20230227 (#273)

Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Baseline owners and maintainers (#275)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Add datasource name validation (#281)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#282)

1. Change variable name from datasourceName to name
2. Change variable name from id to name
3. Added helper methods in test code

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change field name from md5 to sha256 (#285)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement get datasource api (#279)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update index option (#284)

1. Make geodata index as hidden
2. Make geodata index as read only allow delete after creation is done
3. Refresh datasource index immediately after update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make some fields in manifest file as mandatory (#289)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Create datasource index explicitly (#283)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add wrapper class of job scheduler lock service (#290)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove all unused client attributes (#293)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update copyright header (#298)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run system index handling code with stashed thread context (#297)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reduce lock duration and renew the lock during update (#299)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implements delete datasource API (#291)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Set User-Agent in http request (#300)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement datasource update API (#292)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring test code (#302)

Make buildGeoJSONFeatureProcessorConfig method to be more general

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ip2geo processor integ test for failure case (#303)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix and refactoring of code (#305)

1. Bugfix: Ingest metadata can be null if there is no processor created
2. Refactoring: Moved private method to another class for better testing support
3. Refactoring: Set some private static final variable as public so that unit test can use it
4. Refactoring: Changed string value to static variable

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for Ip2GeoProcessor (#306)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ConcurrentModificationException (#308)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for UpdateDatasource API (#307)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix on lock management and few performance improvements (#310)

* Release lock before response back to caller for update/delete API
* Release lock in background task for creation API
* Change index settings to improve indexing performance

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change index setting from read_only_allow_delete to write (#311)

read_only_allow_delete does not block write to an index.
The disk-based shard allocator may add and remove this block automatically.
Therefore, use index.blocks.write instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Fix bug in get datasource API and improve memory usage (#313)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change package for Strings.hasText (#314) (#317)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Do not index blank value and do not enrich null property (#320)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Move index setting keys to constants (#321)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Return null index name for expired data (#322)

Return null index name for expired data so that it can be deleted
by clean up process. Clean up process exclude current index from deleting.
Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add new fields in datasource (#325)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete index once it is expired (#326)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add restoring event listener (#328)

In the listener, we trigger a geoip data update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reverse forcemerge and refresh order (#331)

Otherwise, opensearch does not clear old segment files

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Removed parameter and settings (#332)

* Removed first_only parameter
* Removed max_concurrency and batch_size setting

first_only parameter was added as current geoip processor has it.
However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search.

max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add a field in datasource for current index name (#333)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete GeoIP data indices after restoring complete (#334)

We don't want to use restored GeoIP data indices. Therefore we
delete the indices once restoring process complete.

When GeoIP metadata index is restored, we create a new GeoIP data index instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use bool query for array form of IPs (#335)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run update/delete request in a new thread (#337)

This is not to block transport thread

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove IP2Geo processor validation (#336)

Cannot query index to get data to validate IP2Geo processor.
Will add validation when we decide to store some of data in cluster state metadata.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Acquire lock sychronously (#339)

By acquiring lock asychronously, the remaining part of the code
is being run by transport thread which does not allow blocking code.
We want only single update happen in a node using single thread. However,
it cannot be acheived if I acquire lock asynchronously and pass the listener.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Added a cache to store datasource metadata (#338)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Changed class name and package (#341)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#342)

1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao
2. Moved the Ip2GeoCachedDao from cache to dao package

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add geo data cache (#340)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add cache layer to reduce GeoIp data retrieval latency (#343)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use _primary in query preference and few changes (#347)

1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346
2. Update datasource metadata index mapping
3. Move batch size from static value to setting

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Wait until GeoIP data to be replicated to all data nodes (#348)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (opensearch-project#354)

* Update packages according to a change in OpenSearch core

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (opensearch-project#353)

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants