Improve fidelity of distinct aggregate and pivot transform #234

jonmmease · 2023-01-25T14:23:35Z

This PR includes a few fixes to improve the fidelity (the match with stock Vega) of the distinct aggregate function and the pivot. It also works around the DataFusion error reported in apache/datafusion#5034.

Changes:
Vega's implementation of distinct considers NULL to be a distinct value whereas SQL ignores NULL values before counting distinct values. To align our implementation with Vega's, 4009840 updates the distinct aggregation function to add one if the column contains any NULL values.

d876d0b and e13402c update DataFusion to 16.1.0 with backports as described in jonmmease/arrow-datafusion#139.

455201f Adds a custom spec that previously triggers the error reported in apache/datafusion#5034.

3fd34c6 makes two updates to the pivot transform:

The implementation no longer performs a join per pivoted value and instead uses SQL CASE statements inside the aggregation functions to filter down to the values the pivot value. This avoids Error during physical planning when joining to subquery with count distinct aggregate apache/datafusion#5034, and should improve performance for pivots that introduce many columns as we're no longer performing a join per added column.
If the pivoted column includes NULL values, then a column named "null" will be generated. This. matches Vega. There was some additional work required to make sure the NULL column sorts as the first column for the purpose of limiting.

… unique value This avoids apache/datafusion#5034 while better matching Vega's results in the presence of null values.

jonmmease added 7 commits January 25, 2023 07:28

Match Vega's distinct behavior by adding 1 if there are nulls

4009840

Update conversion of DataFrame to TableProvider

d876d0b

Add pivot distinct custom spec

455201f

Test pivot transform with distinct aggregate and null values

eef8623

Rewrite pivot to use CASE-based implementation rather than a join per…

3fd34c6

… unique value This avoids apache/datafusion#5034 while better matching Vega's results in the presence of null values.

Update DataFusion 1o 16.1.0 with backports

e13402c

Fix warnings / fmt

cc24253

jonmmease merged commit bb892a4 into main Jan 25, 2023

jonmmease mentioned this pull request Jan 25, 2023

Release version 1.0.2 #235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fidelity of distinct aggregate and pivot transform #234

Improve fidelity of distinct aggregate and pivot transform #234

jonmmease commented Jan 25, 2023

Improve fidelity of distinct aggregate and pivot transform #234

Improve fidelity of distinct aggregate and pivot transform #234

Conversation

jonmmease commented Jan 25, 2023