Add Array::logical_null_count for inspecting number of null values #6608

findepi · 2024-10-21T11:09:12Z

Add counter-part of Array::null_count, but counting the logical null values. This will be useful in DataFusion. Current alternative is to compute null mask (via Array::logical_nulls()) and do counting on it. Given this might be expensive and verbose, caller may naturally feel steer towards Array::null_count which may or may not be applicable, depending on the context.

follows Improve Array Logical Nullability #4691

Which issue does this PR close?

relates to Avoid forced copy in Array::logical_nulls #5208

Rationale for this change

#4691 changed semantics of Array::null_count for eg NullArray. DataFusion upgrade to Arrow version with this change introduced a subtle bug, being fixed in apache/datafusion#13029. When working on a fix, it seemed that many usages of Array::null_count should be redirect to count logical nulls (not only the one being updated in that PR). Having a function to count logical nulls would be useful, as alternative is computationally more expensive (may involve creation or copying of a null mask).

What changes are included in this PR?

New Array::logical_null_count function.

Are there any user-facing changes?

No

findepi · 2024-10-21T11:14:22Z

@alamb @tustvold please take a look

tustvold · 2024-10-21T11:54:39Z

I'm not sure about this, in all cases where there can be logical nulls, apart from NullArray, this will involve computing the full logical null mask only to throw it away. This feels like it could be surprising for users, especially given null_count is precomputed and therefore very cheap.

Perhaps we could discuss making is_nullable precise as opposed to best-effort, as IIUC this is what DF is using this method for.

findepi · 2024-10-21T12:01:17Z

@tustvold thank you for taking time to review this PR!

I'm not sure about this, in all cases where there can be logical nulls, apart from NullArray, this will involve computing the full logical null mask only to throw it away.

Good point. this is what callers that need to find out number of logical nulls have to do today.
Examples: https://github.com/apache/datafusion/blob/e9584bc46ffc574cd65044d4199966402def1d15/datafusion/functions-aggregate/src/count.rs#L605-L607, apache/datafusion#13029

Having this function on the Array itself allows us to provide better implementation.
This PR does this for all primitive types, boolean array and null array

tustvold · 2024-10-21T12:21:50Z

Right, my point is that an accurate logical null count can be very expensive to compute, whereas it is much cheaper to instead determine the existence of any nulls. Whilst this won't serve every use-case, my question is whether DF actually needs accurate null counts all the time, or whether most of the time it is just using them as a proxy for nullability. This in turn determines what we optimise for.

findepi · 2024-10-21T12:37:18Z

Right, my point is that an accurate logical null count can be very expensive to compute, whereas it is much cheaper to instead determine the existence of any nulls. Whilst this won't serve every use-case, my question is whether DF actually needs accurate null counts all the time

Not all the time, but often enough.

findepi · 2024-10-21T13:10:17Z

cc @joroKr21

westonpace

This seems like a good idea. Especially since we have a default impl that should work in most cases. Just one question which is probably my misunderstanding around why you chose to overload the default impl in a few spots.

westonpace · 2024-10-21T20:08:21Z

arrow-array/src/array/boolean_array.rs

+    fn logical_null_count(&self) -> usize {
+        self.null_count()
+    }


Why overload here? Is this more efficient somehow?

same reasoning as for primitive arrays -- #6608 (comment)

westonpace · 2024-10-21T20:08:39Z

arrow-array/src/array/primitive_array.rs

+    fn logical_null_count(&self) -> usize {
+        self.null_count()
+    }


Why overload here?

To make logical_null_count as performant as null_count for primitive types (where they happen to be equivalent), so that logical_null_count can be used without, or with fewer, performance drawbacks.

arrow-array/src/array/null_array.rs

tustvold

I feel fairly strongly that we should not merge this, people will likely use this blindly without appreciating the severe performance penalty it entails. I think we should instead make is_nullable accurate, and the places that need an accurate null count should compute the logical null mask explicitly.

findepi · 2024-10-21T20:19:26Z

Thank you @alamb @westonpace @tustvold for your time reviewing this!

I feel fairly strongly that we should not merge this, people will likely use this blindly without appreciating the severe performance penalty it entails.

@tustvold Can you elaborate there the severe performance penalty comes from and what would it take to fix it?

For DataFusion at least the alternative is to all logical_nulls().map(|n| n.null_count()) which is strictly less performant than the alternative offered here. DataFusion will continue to use this slower path until a faster path exists

tustvold · 2024-10-21T20:29:25Z

For DataFusion at least the alternative is to all logical_nulls().map(|n| n.null_count()) which is strictly less performant than the alternative offered here

The problem is for RunArray, DictionaryArray and UnionArray computing logical_nulls is potentially very expensive. Now I accept that there might be a marginal performance win from a specialized logical_null_count implementation, but other than perhaps for NullArray I would expect the difference to largely be a wash. For most types it is a couple of additional atomics, or will be completely dominated by the cost of computing the logical nulls.

The problem with exposing a logical_null_count method is it makes the fact this is effectively computing a fresh null mask implicit, hiding this problem. In fact this PR as written actually regresses is_nullable performance, demonstrating this 😅

Taking a step back, apache/datafusion#13033 is a prime example of a use-case that doesn't actually care what the logical null count is, just whether there are any nulls. With some minor adjustments we could make is_nullable accurate, and this method could just use that.

EDIT: TBC I really dislike the concept of logical nulls, I really wish the arrow specification didn't make the choices it did, UnionArray in particular is extremely perverse, but our hands are somewhat tied by the specification.

Add counter-part of `Array::null_count`, but counting the logical null values. This will be useful in DataFusion. Current alternative is to compute null mask (via `Array::logical_nulls()`) and do counting on it. Given this might be expensive and verbose, caller may naturally feel steer towards `Array::null_count` which may or may not be applicable, depending on the context.

findepi · 2024-10-21T20:35:24Z

The problem is for RunArray, DictionaryArray and UnionArray computing logical_nulls is potentially very expensive.

I see your point, thanks for explaining this to me.

Let's turn the question around. What should the caller do, if they want exactly this: know how many (logical) null values are in the array?

tustvold · 2024-10-21T20:38:56Z

know how many (logical) null values are in the array?

If this is what you need, which it very often isn't, then you have to call logical_nulls() and get the null count from it. The friction is a feature, not a bug 😅

westonpace · 2024-10-21T21:36:59Z

If this is what you need, which it very often isn't, then you have to call logical_nulls() and get the null count from it. The friction is a feature, not a bug 😅

For my sake, this is fine for me. I had found myself needing the logical null count recently (for array statistics) and using logical_nulls wasn't much of a headache.

findepi · 2024-10-22T06:41:23Z

needing the logical null count recently (for array statistics)

@westonpace good point! this was exactly the case in apache/datafusion#13029 too

but that's not the only place -- DataFusion aggregation accumulators often call null_count() and this works only because for quite a few array types null_count() happens to be "logical null count". But contract-wise, this is a wrong function to call, and doesn't generalize to non-primitive types.

If this is what you need, which it very often isn't, then you have to call logical_nulls() and get the null count from it. The friction is a feature, not a bug 😅

@tustvold I don't mind writing more code (friction), but is this efficient at runtime?

tustvold · 2024-10-22T08:06:40Z

but is this efficient at runtime?

As written in this PR, it will be largely equivalent.

Having slept on it, lets just proceed with this. I don't like it, but then I don't like logical nulls in general, but aside from forking the arrow format we're stuck with them. The types it impacts are relatively niche, and if people care to optimise them, they can

findepi · 2024-10-22T09:50:08Z

thank you, that makes sense!
and thank you for the PR scrutiny, this is a good thing.

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Oct 21, 2024

findepi mentioned this pull request Oct 21, 2024

Fix count on all null VALUES clause apache/datafusion#13029

Merged

findepi force-pushed the findepi/logical-null-count branch from bf12f4e to 20c1de2 Compare October 21, 2024 11:12

findepi force-pushed the findepi/logical-null-count branch from 20c1de2 to 8147182 Compare October 21, 2024 11:54

findepi mentioned this pull request Oct 21, 2024

Fix check_not_null_constraints null detection apache/datafusion#13033

Merged

westonpace approved these changes Oct 21, 2024

View reviewed changes

tustvold requested changes Oct 21, 2024

View reviewed changes

findepi requested a review from tustvold October 21, 2024 20:21

findepi force-pushed the findepi/logical-null-count branch from 8147182 to 5c22898 Compare October 21, 2024 20:34

tustvold merged commit 7e51d40 into apache:master Oct 22, 2024
28 checks passed

findepi deleted the findepi/logical-null-count branch October 22, 2024 09:49

findepi mentioned this pull request Oct 22, 2024

Improve Array::is_nullable documentation #6615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Array::logical_null_count for inspecting number of null values #6608

Add Array::logical_null_count for inspecting number of null values #6608

findepi commented Oct 21, 2024

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024 •

edited

Loading

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024 •

edited

Loading

findepi commented Oct 21, 2024

findepi commented Oct 21, 2024

westonpace left a comment

westonpace Oct 21, 2024

findepi Oct 21, 2024

westonpace Oct 21, 2024

findepi Oct 21, 2024

tustvold left a comment •

edited

Loading

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024 •

edited

Loading

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024

westonpace commented Oct 21, 2024

findepi commented Oct 22, 2024

tustvold commented Oct 22, 2024

findepi commented Oct 22, 2024

Add Array::logical_null_count for inspecting number of null values #6608

Add Array::logical_null_count for inspecting number of null values #6608

Conversation

findepi commented Oct 21, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024 • edited Loading

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024 • edited Loading

findepi commented Oct 21, 2024

findepi commented Oct 21, 2024

westonpace left a comment

Choose a reason for hiding this comment

westonpace Oct 21, 2024

Choose a reason for hiding this comment

findepi Oct 21, 2024

Choose a reason for hiding this comment

westonpace Oct 21, 2024

Choose a reason for hiding this comment

findepi Oct 21, 2024

Choose a reason for hiding this comment

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024 • edited Loading

findepi commented Oct 21, 2024

tustvold commented Oct 21, 2024

westonpace commented Oct 21, 2024

findepi commented Oct 22, 2024

tustvold commented Oct 22, 2024

findepi commented Oct 22, 2024

tustvold commented Oct 21, 2024 •

edited

Loading

tustvold commented Oct 21, 2024 •

edited

Loading

tustvold left a comment •

edited

Loading

tustvold commented Oct 21, 2024 •

edited

Loading