-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. #10767
Conversation
Also a refactoring of BufferAggregator and VectorAggregator such that they share a common interface, BaseBufferAggregator. This allows implementing both in the same file with an abstract + dual subclass structure.
By the way, I did a similar thing in #10304 to reduce some boilerplate code but used composition instead of inheritance. Added package-private |
Ah, thanks for pointing that out. I haven't had a chance to look at your patch yet. Do you think I should rework this patch to do the same thing you did? I suppose we don't want to have too many ways of doing the same thing in the codebase. |
Right. I was being subtle about it :) It would be nice to use the same style everywhere. If that pattern needs some modification, I can do that as well. |
Which pattern do you think is better? I haven't had a chance to take a look at yours yet, so I'm wondering if I should make yours match mine, or mine match yours. |
I usually avoid abstract classes so there is no tight coupling. It can make unit testing easier as well. I can pass a mock helper into a vector aggregator and unit test in different ways easily. so my preference is toward the way ApproximateHistogram is implemented. |
I've pushed up a new patch that undoes the refactoring and instead implements things using the same technique as #10304. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
regular types. Involved finally moving makeVectorProcessor from DimensionHandlerUtils into ColumnProcessors and harmonizing the two things.
When debugging a test failure, I realized that some additional work is needed to make the aggregator able to handle both complex inputs (existing theta sketches) and regular inputs (strings, primitives). There's some debt that needed to be paid down:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall refactor lgtm, its nice to untangle vector stuff from DimensionHandlerUtils
, i think you've got some style issues failing in CI
/** | ||
* Called when {@link ColumnCapabilities#getType()} is COMPLEX. | ||
*/ | ||
T makeComplexProcessor(ColumnCapabilities capabilities, VectorObjectSelector selector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this isn't actually specific to complex columns, can this be called makeObjectProcessor
? I have the same method signature except with that name added to this class in #10613, there to support using object selectors for string expression virtual columns which are not dictionary encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, although I'll keep the javadoc the same for now, since as of this patch it's still only called when the type is COMPLEX. It seems that one of us will need to rework implementations a bit based on which patch gets merged first.
Pushed up a new patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 after CI (still some style/dependency/inspection issues)
The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode.
* Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as #10767 and #10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc.
* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. * Doc updates for union datasources. (apache#11103) The main one is updating datasources.md to talk about SQL. (It still said that table unions are not supported in SQL.) Also, this doc update adds some clarifying details on limitations. * [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117) * Vectorized versions of HllSketch aggregators. (apache#11115) * Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc. * Web console: update dev dependencies (apache#11119) * Update some dev dependencies, prettify, tslint-fix * Sort tsconfig keys for easy comparison * Set noImplicitThis * Slightly more accurate types * Bump Jest and related * Bump react to latest on v16 * Bump node-sass, sass-loader for node14 support * Remove node-sass-chokidar (unused) * More unused dependencies * Fix blueprint imports * Webpack 5 * Update webpack config for 'process' usage * Update playwright-chromium * Emit esnext modules for tree shaking * Enable source maps in development * Dedupe * Bump babel and things * npm audit fix * Add .editorconfig file to match prettier settings * Update licenses (tslib is 0BSD as of 1.11.2) microsoft/tslib#96 * Require node >= 10 * Use Node 10 to run e2e tests * Use 'ws' transport mode for dev server (will be default in next version) * Remove an 'any' * No sourcemaps in prod * Exclude .editorconfig from license checks * Try nvm for setting node version Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Sandeep <isandeep41@gmail.com> Co-authored-by: John Gozde <john@gozde.ca>
Update 2: the patch has been expanded to pay down some debt related to how vector processors handle complex columns, in order to ensure that the theta agg behaves properly in all situations. (See comment below.)
Update: the patch has been reworked to use composition instead of inheritance, using the same technique from #10304 mentioned by @abhishekagarwal87.
Also a refactoring of BufferAggregator and VectorAggregator such thatthey share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.