Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve transform ordering using explicit ordering column #222

Merged
merged 16 commits into from
Jan 12, 2023

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Jan 10, 2023

Background

Unlike SQL, The row order of the result of Vega transforms is deterministic based on the transform specification and the order of the input rows. For example, the rows that result from the aggregate transform are ordered according to first appearance of a row in each group.

Prior to this PR, we attempted to match Vega's ordering by having individual transforms use a ROW_NUMBER() window function on their input, and the resort their output based on this ROW_NUMBER(). This worked most of the time, but it was a bit of a hack since creating a new column using a ROW_NUMBER window function, and no ordering specification, isn't guaranteed to be sorted with the input rows.

Here is one example where this approach goes wrong: apache/datafusion#4673.

Overview

This PR introduces an explicit ordering column named _vf_order. This column is added to every input dataset, and every transforms inputs and outputs the column. The final dataset is sorted by _vf_order after all of the transforms in the pipeline have been applied.

@jonmmease
Copy link
Collaborator Author

Two tests failing due to apache/datafusion#4873

@jonmmease
Copy link
Collaborator Author

jonmmease commented Jan 11, 2023

Updated the custom DataFusion branch to include apache/datafusion#4878

@jonmmease jonmmease merged commit d6b88e1 into develop_v1.0 Jan 12, 2023
@jonmmease jonmmease mentioned this pull request Jan 12, 2023
jonmmease added a commit that referenced this pull request Jan 20, 2023
* Add VegaFusionTable::with_ordering method

* Use explicit ordering column instead of relying on the row_number window function in each transform

* Support impute null

* Add impute tests and fix serialization of null value

* Add ordering to inline tables

* Remove test case workaround to get consistent ordering

* Remove order column when inline dataset has no transforms

* Update DataFusion branch to include fix (apache/datafusion#4878)

* Add area streamgraph spec that used to trigger error

* Remove row number from joinaggregate transform
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant