Improve fidelity of distinct aggregate and pivot transform #234
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes a few fixes to improve the fidelity (the match with stock Vega) of the
distinct
aggregate function and thepivot
. It also works around the DataFusion error reported in apache/datafusion#5034.Changes:
Vega's implementation of
distinct
considers NULL to be a distinct value whereas SQL ignores NULL values before counting distinct values. To align our implementation with Vega's, 4009840 updates thedistinct
aggregation function to add one if the column contains any NULL values.d876d0b and e13402c update DataFusion to 16.1.0 with backports as described in jonmmease/arrow-datafusion#139.
455201f Adds a custom spec that previously triggers the error reported in apache/datafusion#5034.
3fd34c6 makes two updates to the pivot transform:
CASE
statements inside the aggregation functions to filter down to the values the pivot value. This avoids Error during physical planning when joining to subquery with count distinct aggregate apache/datafusion#5034, and should improve performance for pivots that introduce many columns as we're no longer performing a join per added column.