Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix count on all null VALUES clause #13029

Merged
merged 2 commits into from
Oct 21, 2024
Merged

Conversation

findepi
Copy link
Member

@findepi findepi commented Oct 21, 2024

Before the change, the ValuesExec containing NullArray would
incorrectly report column statistics as being non-null, which would
misinform AggregateStatistics optimizer and fold count(always_null)
into row count instead of 0.

This commit fixes the column statistics derivation for values with
NullArray and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the VALUES (NULL) doesn't have type
DataType:Null (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of Array:null_count should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).

@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate functions labels Oct 21, 2024
@@ -156,7 +156,11 @@ pub fn compute_record_batch_statistics(
for partition in batches.iter() {
for batch in partition {
for (stat_index, col_index) in projection.iter().enumerate() {
null_counts[stat_index] += batch.column(*col_index).null_count();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of this line was apparently changed in apache/arrow-rs#4691
At least it seems so from this diff line apache/arrow-rs@979a070#diff-0cfddb6ef017ce20c5e7f528095956b2433cf2ea2d17e28ba4b515b7c1bd2e57L179

@findepi findepi force-pushed the findepi/count-null-only branch 3 times, most recently from 9d23575 to dae2610 Compare October 21, 2024 10:56
Before the change, the `ValuesExec` containing `NullArray` would
incorrectly report column statistics as being non-null, which would
misinform `AggregateStatistics` optimizer and fold `count(always_null)`
into row count instead of 0.

This commit fixes the column statistics derivation for values with
`NullArray` and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the `VALUES (NULL)` doesn't have type
`DataType:Null` (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of `Array:null_count` should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).
null_counts[stat_index] += batch
.column(*col_index)
.logical_nulls()
.map(|nulls| nulls.null_count())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified back if we had something like apache/arrow-rs#6608

@findepi
Copy link
Member Author

findepi commented Oct 21, 2024

@alamb @joroKr21 please take a look

Copy link
Contributor

@joroKr21 joroKr21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh yeah, I don't know why arrow-rs made that choice. To me the "physical number of nulls" seems kinda useless. I care about the semantics, not the implementation details 😄

@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

Ugh yeah, I don't know why arrow-rs made that choice. To me the "physical number of nulls" seems kinda useless. I care about the semantics, not the implementation details 😄

The reason is performance -- I agree it is quite confusing. THe rationale is that arrow-rs I think tries to let the user have maximal control, but that does make the API harder to reason about sometime.

I tried to clarify this in the docs, but I am also still not happy with how easy it is to get confused:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.nulls

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @joroKr21 -- this makes sense to me

@alamb alamb changed the title Fix count on null values Fix count on all null VALUES clause Oct 21, 2024
@joroKr21
Copy link
Contributor

I tried to clarify this in the docs, but I am also still not happy with how easy it is to get confused:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.nulls

The physical representation is efficient, but is sometimes non intuitive for certain array types such as those with nullable child arrays like DictionaryArray::values, RunArray::values or UnionArray, or without a null buffer, such as NullArray.

Yeah, I haven't read the Arrow spec so I'm not sure why you would do that (encode nulls in the child arrays)

@alamb alamb merged commit 34fbe8e into apache:main Oct 21, 2024
26 checks passed
@findepi findepi deleted the findepi/count-null-only branch October 22, 2024 06:41
findepi added a commit to findepi/datafusion that referenced this pull request Oct 22, 2024
* Test Count accumulator with all-nulls

* Fix count on null values

Before the change, the `ValuesExec` containing `NullArray` would
incorrectly report column statistics as being non-null, which would
misinform `AggregateStatistics` optimizer and fold `count(always_null)`
into row count instead of 0.

This commit fixes the column statistics derivation for values with
`NullArray` and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the `VALUES (NULL)` doesn't have type
`DataType:Null` (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of `Array:null_count` should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).
findepi added a commit to findepi/datafusion that referenced this pull request Oct 23, 2024
* Test Count accumulator with all-nulls

* Fix count on null values

Before the change, the `ValuesExec` containing `NullArray` would
incorrectly report column statistics as being non-null, which would
misinform `AggregateStatistics` optimizer and fold `count(always_null)`
into row count instead of 0.

This commit fixes the column statistics derivation for values with
`NullArray` and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the `VALUES (NULL)` doesn't have type
`DataType:Null` (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of `Array:null_count` should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).
findepi added a commit to findepi/datafusion that referenced this pull request Oct 24, 2024
* Test Count accumulator with all-nulls

* Fix count on null values

Before the change, the `ValuesExec` containing `NullArray` would
incorrectly report column statistics as being non-null, which would
misinform `AggregateStatistics` optimizer and fold `count(always_null)`
into row count instead of 0.

This commit fixes the column statistics derivation for values with
`NullArray` and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the `VALUES (NULL)` doesn't have type
`DataType:Null` (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of `Array:null_count` should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).
findepi added a commit to sdf-labs/arrow-datafusion that referenced this pull request Oct 25, 2024
* Test Count accumulator with all-nulls

* Fix count on null values

Before the change, the `ValuesExec` containing `NullArray` would
incorrectly report column statistics as being non-null, which would
misinform `AggregateStatistics` optimizer and fold `count(always_null)`
into row count instead of 0.

This commit fixes the column statistics derivation for values with
`NullArray` and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the `VALUES (NULL)` doesn't have type
`DataType:Null` (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of `Array:null_count` should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants