Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. #10767

Merged
merged 10 commits into from
Jan 29, 2021

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Jan 15, 2021

Update 2: the patch has been expanded to pay down some debt related to how vector processors handle complex columns, in order to ensure that the theta agg behaves properly in all situations. (See comment below.)

Update: the patch has been reworked to use composition instead of inheritance, using the same technique from #10304 mentioned by @abhishekagarwal87.

Also a refactoring of BufferAggregator and VectorAggregator such that
they share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.

Also a refactoring of BufferAggregator and VectorAggregator such that
they share a common interface, BaseBufferAggregator. This allows
implementing both in the same file with an abstract + dual subclass
structure.
@abhishekagarwal87
Copy link
Contributor

By the way, I did a similar thing in #10304 to reduce some boilerplate code but used composition instead of inheritance. Added package-private*BufferAggregatorHelper classes which in turn are used by vector and buffer aggregator. The helper classes know how to work with buffers. The logic for getting an object out of selector and vectorization stuff was left out for Buffer and Vector classes.

@gianm
Copy link
Contributor Author

gianm commented Jan 15, 2021

By the way, I did a similar thing in #10304 to reduce some boilerplate code but used composition instead of inheritance. Added package-private*BufferAggregatorHelper classes which in turn are used by vector and buffer aggregator. The helper classes know how to work with buffers. The logic for getting an object out of selector and vectorization stuff was left out for Buffer and Vector classes.

Ah, thanks for pointing that out. I haven't had a chance to look at your patch yet. Do you think I should rework this patch to do the same thing you did? I suppose we don't want to have too many ways of doing the same thing in the codebase.

@abhishekagarwal87
Copy link
Contributor

Right. I was being subtle about it :) It would be nice to use the same style everywhere. If that pattern needs some modification, I can do that as well.

@gianm
Copy link
Contributor Author

gianm commented Jan 15, 2021

Which pattern do you think is better? I haven't had a chance to take a look at yours yet, so I'm wondering if I should make yours match mine, or mine match yours.

@abhishekagarwal87
Copy link
Contributor

I usually avoid abstract classes so there is no tight coupling. It can make unit testing easier as well. I can pass a mock helper into a vector aggregator and unit test in different ways easily. so my preference is toward the way ApproximateHistogram is implemented.

@gianm
Copy link
Contributor Author

gianm commented Jan 15, 2021

I've pushed up a new patch that undoes the refactoring and instead implements things using the same technique as #10304.

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

regular types.

Involved finally moving makeVectorProcessor from DimensionHandlerUtils
into ColumnProcessors and harmonizing the two things.
@gianm
Copy link
Contributor Author

gianm commented Jan 20, 2021

When debugging a test failure, I realized that some additional work is needed to make the aggregator able to handle both complex inputs (existing theta sketches) and regular inputs (strings, primitives). There's some debt that needed to be paid down:

  • VectorColumnProcessorFactory had to be taught about complex columns, instead of handling them all as all-null string columns
  • Existing users of VectorColumnProcessorFactory had to be taught about complex columns too
  • To support the above, makeVectorProcessor needed to be moved from DimensionHandlerUtils to ColumnProcessors and harmonized with the ColumnProcessor stuff

@gianm gianm changed the title Vectorized theta sketch aggregator. Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory. Jan 20, 2021
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall refactor lgtm, its nice to untangle vector stuff from DimensionHandlerUtils, i think you've got some style issues failing in CI

/**
* Called when {@link ColumnCapabilities#getType()} is COMPLEX.
*/
T makeComplexProcessor(ColumnCapabilities capabilities, VectorObjectSelector selector);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't actually specific to complex columns, can this be called makeObjectProcessor? I have the same method signature except with that name added to this class in #10613, there to support using object selectors for string expression virtual columns which are not dictionary encoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although I'll keep the javadoc the same for now, since as of this patch it's still only called when the type is COMPLEX. It seems that one of us will need to rework implementations a bit based on which patch gets merged first.

@gianm
Copy link
Contributor Author

gianm commented Jan 20, 2021

Pushed up a new patch.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI (still some style/dependency/inspection issues)

@gianm gianm merged commit 6c0c6e6 into apache:master Jan 29, 2021
@gianm gianm deleted the theta-vectorize branch January 29, 2021 17:30
gianm added a commit to gianm/druid that referenced this pull request Apr 14, 2021
The patch uses the same "helper" approach as apache#10767 and apache#10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.
gianm added a commit that referenced this pull request Apr 17, 2021
* Vectorized versions of HllSketch aggregators.

The patch uses the same "helper" approach as #10767 and #10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.

* Updates post-code-review.

* Fix javadoc.
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 22, 2021
* Update security overview with additional recommendations (apache#11016)

* updatee security overview with additional recommendations for improved security

* address first set of review questions

* Update docs/operations/security-overview.md

* Update docs/operations/security-overview.md

* apply changes from review

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update security-overview.md

fix additional comments & typos cc: @suneet-s, @jihoonsoon

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Enable rewriting certain inner joins as filters. (apache#11068)

* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.

* Doc updates for union datasources. (apache#11103)

The main one is updating datasources.md to talk about SQL. (It still said
that table unions are not supported in SQL.) Also, this doc update adds
some clarifying details on limitations.

* [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117)

* Vectorized versions of HllSketch aggregators. (apache#11115)

* Vectorized versions of HllSketch aggregators.

The patch uses the same "helper" approach as apache#10767 and apache#10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.

* Updates post-code-review.

* Fix javadoc.

* Web console: update dev dependencies (apache#11119)

* Update some dev dependencies, prettify, tslint-fix

* Sort tsconfig keys for easy comparison

* Set noImplicitThis

* Slightly more accurate types

* Bump Jest and related

* Bump react to latest on v16

* Bump node-sass, sass-loader for node14 support

* Remove node-sass-chokidar (unused)

* More unused dependencies

* Fix blueprint imports

* Webpack 5

* Update webpack config for 'process' usage

* Update playwright-chromium

* Emit esnext modules for tree shaking

* Enable source maps in development

* Dedupe

* Bump babel and things

* npm audit fix

* Add .editorconfig file to match prettier settings

* Update licenses (tslib is 0BSD as of 1.11.2)

microsoft/tslib#96

* Require node >= 10

* Use Node 10 to run e2e tests

* Use 'ws' transport mode for dev server (will be default in next version)

* Remove an 'any'

* No sourcemaps in prod

* Exclude .editorconfig from license checks

* Try nvm for setting node version

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Co-authored-by: Sandeep <isandeep41@gmail.com>
Co-authored-by: John Gozde <john@gozde.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants