-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial projection pushdown optimization #113
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
comment out playfair mock test which uses a named mark as a dataset.
jonmmease
changed the title
[WIP] Initial projection pushdown optimization
Initial projection pushdown optimization
May 26, 2022
This was referenced May 26, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR introduces a framework for identifying the usage of columns within datasets and uses that to add a "projection pushdown" optimization pass to the planner.
Column Usage
A key construct introduced by this PR is that of the "Column Usage" of a dataset. The column usage of a dataset can either be a known set of columns, or it can be unknown. This is represented in Rust by the
ColumnUsage
enum. When a dataset it used in multiple contexts (e.g. multiple encoding channels) the usages for each context can be combined with the following sort of union operation:The column usage must be maintained for every dataset in the specification individually.
Projection pushdown
Here is the outline of the projection pushdown optimization:
project
transform to the dataset's transform array which will downselect the columns to include only those that are used elsewhere in the specification. No change is made to datasets with unknown column usage.Support and Limitations
Encoding
This PR includes fairly precise determination of column usage within marks. In particular, it correctly identifies the usage of columns in various forms of encoding channels. For example, it will identify the usage of columns "one", "two", "three", and "four" in the following encoding specification:
Scales
It will also identify the precise use of columns in scale domains that are computed from a dataset field.
Transforms
This PR does not include support for identifying the precise usage of columns within transform pipelines. So if a dataset is used as the "source" of a derived dataset then it's column usage will be unknown, and no projection transform will be added.
Most of the infrastructure is in place to add this support in the future.
vlSelectionTest
When selections are used, Vega-Lite generates expressions that use the special
vlSelectionTest('store', datum)
function. Determining the column usage for this expression is complex because the columns used are determined by the contents of a secondary "store" dataset. If the fields contained in the secondary store dataset are known, the logic in this PR will correctly make use of them. But the PR does not contain any logic to determine the contents of secondary store datasets. Currently, the use ofvlSelectionTest
will result in unknown column usage.