-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve transform ordering using explicit ordering column #222
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
instead of relying on the row_number window function in each transform
Two tests failing due to apache/datafusion#4873 |
Updated the custom DataFusion branch to include apache/datafusion#4878 |
Merged
jonmmease
added a commit
that referenced
this pull request
Jan 20, 2023
* Add VegaFusionTable::with_ordering method * Use explicit ordering column instead of relying on the row_number window function in each transform * Support impute null * Add impute tests and fix serialization of null value * Add ordering to inline tables * Remove test case workaround to get consistent ordering * Remove order column when inline dataset has no transforms * Update DataFusion branch to include fix (apache/datafusion#4878) * Add area streamgraph spec that used to trigger error * Remove row number from joinaggregate transform
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Unlike SQL, The row order of the result of Vega transforms is deterministic based on the transform specification and the order of the input rows. For example, the rows that result from the
aggregate
transform are ordered according to first appearance of a row in each group.Prior to this PR, we attempted to match Vega's ordering by having individual transforms use a ROW_NUMBER() window function on their input, and the resort their output based on this ROW_NUMBER(). This worked most of the time, but it was a bit of a hack since creating a new column using a ROW_NUMBER window function, and no ordering specification, isn't guaranteed to be sorted with the input rows.
Here is one example where this approach goes wrong: apache/datafusion#4673.
Overview
This PR introduces an explicit ordering column named
_vf_order
. This column is added to every input dataset, and every transforms inputs and outputs the column. The final dataset is sorted by_vf_order
after all of the transforms in the pipeline have been applied.