-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable rewriting certain inner joins as filters. #11068
Conversation
The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled.
return parseBoolean( | ||
query, | ||
REWRITE_JOIN_TO_FILTER_ENABLE_KEY, | ||
DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEFAULT_ENABLE_JOIN_FILTER_REWRITE_VALUE_COLUMN_FILTERS | |
DEFAULT_ENABLE_REWRITE_JOIN_TO_FILTER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Out of curiosity, given that we exclude null values, how are null values in the left table matched?
/** | ||
* Returns a Set of all keys in this lookup extractor. The returned Set will not change. | ||
* | ||
* @throws UnsupportedOperationException if {@link #canIterate()} returns false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @throws UnsupportedOperationException if {@link #canIterate()} returns false. | |
* @throws UnsupportedOperationException if {@link #canGetKeySet()} returns false. |
for (int i = 0; i < table.numRows(); i++) { | ||
final String s = DimensionHandlerUtils.convertObjectToString(reader.read(i)); | ||
|
||
if (s != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need not exclude empty strings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think we should. I replaced it with NullHandling.isNullOrEquivalent(s)
.
final Pair<List<Filter>, List<JoinableClause>> conversionResult = convertJoinsToFilters( | ||
joinableClauses.getJoinableClauses(), | ||
requiredColumns, | ||
Ints.checkedCast(Math.min(filterRewriteConfig.getFilterRewriteMaxSize(), Integer.MAX_VALUE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not here but should we cap the max size so that it can't result in an OOM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we should rely on the user setting this parameter "correctly" sort of like the subquery limit. I also think most people won't change it from the default, which is 10,000 and should be pretty safe. Unless the values are gigantic it's only going to be a few MB per query.
I thought a bit about measuring these limits in terms of bytes instead of rows, which has pros/cons:
- Pro of bytes: less likely to be misconfigured & cause OOME, more likely to use memory efficiently & maximally
- Con of bytes: harder for users to understand the limit. "10,000 rows" is easy to communicate & understand; "5MB" is harder because people won't be able to easily figure out if a particular data set fits in 5MB or not.
I was thinking that because it's an inner join, null values from the left table are supposed to be dropped anyway. (That's why I didn't allow this optimization to trigger for left joins.) |
Got it. Thanks for clarifying. |
* @param additionalColumns additional columns to include. Each of these will be added to the returned set, unless it | ||
* refers to a virtual column, in which case the virtual column inputs will be added instead. | ||
*/ | ||
public static Set<String> computeRequiredColumns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@abhishekagarwal87 any other thoughts? |
Normally, InDimFilters that come from JSON have HashSets for "values". However, programmatically-generated filters (like the ones from apache#11068) may use other set types. Some set types, like TreeSets with natural ordering, will throw NPE on "contains(null)", which causes the InDimFilter's ValueMatcher to throw NPE if it encounters a null value. This patch adds code to detect if the values set can support contains(null), and if not, wrap that in a null-checking lambda. Also included: - Remove unneeded NullHandling.needsEmptyToNull method. - Update IndexedTableJoinable to generate a TreeSet that does not require lambda-wrapping. (This particular TreeSet is how I noticed the bug in the first place.)
* InDimFilter: Fix NPE involving certain Set types. Normally, InDimFilters that come from JSON have HashSets for "values". However, programmatically-generated filters (like the ones from #11068) may use other set types. Some set types, like TreeSets with natural ordering, will throw NPE on "contains(null)", which causes the InDimFilter's ValueMatcher to throw NPE if it encounters a null value. This patch adds code to detect if the values set can support contains(null), and if not, wrap that in a null-checking lambda. Also included: - Remove unneeded NullHandling.needsEmptyToNull method. - Update IndexedTableJoinable to generate a TreeSet that does not require lambda-wrapping. (This particular TreeSet is how I noticed the bug in the first place.) * Test fixes. * Improve test coverage
* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. * Doc updates for union datasources. (apache#11103) The main one is updating datasources.md to talk about SQL. (It still said that table unions are not supported in SQL.) Also, this doc update adds some clarifying details on limitations. * [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117) * Vectorized versions of HllSketch aggregators. (apache#11115) * Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc. * Web console: update dev dependencies (apache#11119) * Update some dev dependencies, prettify, tslint-fix * Sort tsconfig keys for easy comparison * Set noImplicitThis * Slightly more accurate types * Bump Jest and related * Bump react to latest on v16 * Bump node-sass, sass-loader for node14 support * Remove node-sass-chokidar (unused) * More unused dependencies * Fix blueprint imports * Webpack 5 * Update webpack config for 'process' usage * Update playwright-chromium * Emit esnext modules for tree shaking * Enable source maps in development * Dedupe * Bump babel and things * npm audit fix * Add .editorconfig file to match prettier settings * Update licenses (tslib is 0BSD as of 1.11.2) microsoft/tslib#96 * Require node >= 10 * Use Node 10 to run e2e tests * Use 'ws' transport mode for dev server (will be default in next version) * Remove an 'any' * No sourcemaps in prod * Exclude .editorconfig from license checks * Try nvm for setting node version Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Sandeep <isandeep41@gmail.com> Co-authored-by: John Gozde <john@gozde.ca>
The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:
duplicate values. (If they did, the inner join would not be guaranteed
to return at most one row for each left-hand-side row.)
the join condition itself.
HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.
In support of this goal:
the right-hand side of a join datasource is being used or not.
figure out if the right-hand side of a join is being used by later
join clauses acting on the same base.
retrieving the set of values that will form the "in" filter.
LookupJoinable in its efforts to implement the new Joinable method.
JoinFilterRewriteConfig. The default is disabled.
Testing strategy:
provide query contexts that enable this rewrite.
this rewrite. (And some existing tests did, too.)