Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fidelity of distinct aggregate and pivot transform #234

Merged
merged 7 commits into from
Jan 25, 2023

Conversation

jonmmease
Copy link
Collaborator

This PR includes a few fixes to improve the fidelity (the match with stock Vega) of the distinct aggregate function and the pivot. It also works around the DataFusion error reported in apache/datafusion#5034.

Changes:
Vega's implementation of distinct considers NULL to be a distinct value whereas SQL ignores NULL values before counting distinct values. To align our implementation with Vega's, 4009840 updates the distinct aggregation function to add one if the column contains any NULL values.

d876d0b and e13402c update DataFusion to 16.1.0 with backports as described in jonmmease/arrow-datafusion#139.

455201f Adds a custom spec that previously triggers the error reported in apache/datafusion#5034.

3fd34c6 makes two updates to the pivot transform:

  1. The implementation no longer performs a join per pivoted value and instead uses SQL CASE statements inside the aggregation functions to filter down to the values the pivot value. This avoids Error during physical planning when joining to subquery with count distinct aggregate apache/datafusion#5034, and should improve performance for pivots that introduce many columns as we're no longer performing a join per added column.
  2. If the pivoted column includes NULL values, then a column named "null" will be generated. This. matches Vega. There was some additional work required to make sure the NULL column sorts as the first column for the purpose of limiting.

@jonmmease jonmmease merged commit bb892a4 into main Jan 25, 2023
@jonmmease jonmmease mentioned this pull request Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant