Add batch operations to stddev #1547

realno · 2022-01-11T03:28:45Z

Which issue does this PR close?

Closes #1546 .

Rationale for this change

For more efficient calculation, we want to implement the batch methods.

What changes are included in this PR?

Are there any user-facing changes?

No user facing changes.

No break changes.

realno · 2022-01-11T03:35:30Z

@alamb per our discussion, this is to add batch operations for stddev and var.

alamb · 2022-01-11T11:56:54Z

Thank you @realno -- I will try and review this later today

alamb

Looks good to me -- thanks @realno

This PR makes it very clear that the Aggregate API is very confusing.

For aggregates, only update_batch and merge_batch are needed. The extra implementations of update and merge_batch in terms of ScalarValue is not ever called (I actually deleted the implementation of update and merge locally and all the tests still pass)

Thus, my plan is to:

Make a follow on PR that removes the implementation of Variance in terms of ScalarValue (as well as the supporting math functions)
Create a proposed PR that removes update and merge completely in favor of some adapter functions and documentation

Thanks again for this very high quality work 🏅

alamb · 2022-01-11T20:20:01Z

datafusion/src/physical_plan/expressions/variance.rs

+        for i in 0..arr.len() {
+            let value = arr.value(i);
+
+            if value == 0_f64 && values.is_null(i) {
+                continue;
+            }
+            let new_count = self.count + 1;


Here is a more idiomatic way to iterate over the array and skip nulls (and also faster as it doesn't check the bounds on each access to arr.value(i):

Suggested change

for i in 0..arr.len() {

let value = arr.value(i);

if value == 0_f64 && values.is_null(i) {

continue;

}

let new_count = self.count + 1;

// NB: filter map skips `None` (null) values

for value in arr.iter().filter_map(|v| v) {

let new_count = self.count + 1;

Great suggestion!

After some investigation, this approach does work as expected. The reason for the null check is because downcast_ref replace the None values into 0_f64 so we need to check in the original array when a 0 is observed. The proposed code checks the array after the type cast so it can't catch the nulls. I tried to find a good way to do similar things on the original array but yet to have any luck. I will dig a bit deeper later, please let me know if you a way to achieve this.

After some investigation, this approach does work as expected.

FWIW I tried making these changes locally and all the tests passed for me

The reason for the null check is because downcast_ref replace the None values into 0_f64 so we need to check in the original array when a 0 is observed.

I am not sure this is accurate. The way arrow works is that the values and "validity" are tracked in separate structures.

Thus for elements that are NULL there is some arbitrary value in the array (which will likely be 0.0f, though that is not guaranteed by the arrow spec).

The construct of arr.iter() returns an iterator of Option<f64> that is None if the element is NULL, and Some(f64_value) if the element is non-NULL.

the use of filter_map then filters out the None elements, somewhat non obviously

This

for value in arr.iter().filter_map(|v| v) {

Is effectively the same as

for value in arr.iter() { let value = match value { Some(v) => v, None => continue, };

So I actually think there is a bug in this code as written with nulls -- the check should be

if values.is_null(i) {

Rather than

if value == 0_f64 && values.is_null(i) {

(as null values are not guaranteed to be 0.0f)

Thanks for clarifying, I will do some more testing locally and follow up with a PR (or more questions :D).

alamb · 2022-01-11T20:55:49Z

datafusion/src/physical_plan/expressions/variance.rs

+        for i in 0..counts.len() {
+            let c = counts.value(i);
+            if c == 0_u64 {
+                continue;
+            }
+            let new_count = self.count + c;


Suggested change

for i in 0..counts.len() {

let c = counts.value(i);

if c == 0_u64 {

continue;

}

let new_count = self.count + c;

let non_null_counts = counts

.iter()

.enumerate()

.filter_map(|(i, c)| c.map(|c| (i, c)));

for (i,c) in non_null_counts {

let new_count = self.count + c;

By the same logic as above this also skips checking bounds on each row. Though for sure I would say this is less readable :(

Great suggestion! For this part of the code the length of the array is pretty small (number of partitions to merge), so maybe we can opt for readability here.

seems reasonable to me

Dandandan · 2022-01-11T21:06:37Z

datafusion/src/physical_plan/expressions/variance.rs

@@ -209,8 +214,8 @@ impl AggregateExpr for VariancePop {

 #[derive(Debug)]
 pub struct VarianceAccumulator {
-    m2: ScalarValue,
-    mean: ScalarValue,
+    m2: f64,


alamb · 2022-01-11T21:54:34Z

See #1550 for a cleanup of some of this code.

realno added 26 commits January 3, 2022 21:35

Initial implementation of variance

0332f7f

get simple f64 type tests working

ba1140b

add math functions to ScalarValue, some tests

ae2cc92

add to expressions and tests

70b1116

add more tests

522a960

add test for ScalarValue add

fafab18

add tests for scalar arithmetic

031e8c0

add test, finish variance

d2a2755

fix warnings

b3729b8

add more sql tests

48a1485

add stddev and tests

9c01311

add the hooks and expression

b2747f5

add more tests

d393cb9

fix lint and clipy

0e534a4

address comments and fix test errors

4c3d58c

address comments

cf74208

add population and sample for variance and stddev

1c5a303

address more comments

8e88267

fmt

dd00758

add test for less than 2 values

844ffd5

fix inconsistency in the merge logic

987a604

fix lint and clipy

d2ff16d

use batch operations

e9f753d

remove unused code

3d40e97

merge master

f9c5bac

lint and clipy

57f75f2

github-actions bot added the datafusion Changes in the datafusion crate label Jan 11, 2022

realno mentioned this pull request Jan 11, 2022

Add stddev and variance #1525

Merged

fix s typo

65ac667

realno added 2 commits January 11, 2022 10:56

clipy fix

9c816ba

fix lint

7cfef2c

alamb approved these changes Jan 11, 2022

View reviewed changes

Dandandan reviewed Jan 11, 2022

View reviewed changes

alamb merged commit 06d147a into apache:master Jan 11, 2022

alamb mentioned this pull request Jan 11, 2022

Proposal: Remove Accumulator::update and Accumulator::merge #1549

Closed

realno deleted the add-batch-operations-to-stddev branch February 9, 2022 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch operations to stddev #1547

Add batch operations to stddev #1547

realno commented Jan 11, 2022

realno commented Jan 11, 2022

alamb commented Jan 11, 2022

alamb left a comment

alamb Jan 11, 2022

realno Jan 11, 2022

realno Jan 11, 2022 •

edited

Loading

alamb Jan 12, 2022

realno Jan 12, 2022

alamb Jan 11, 2022

realno Jan 11, 2022

alamb Jan 12, 2022

Dandandan Jan 11, 2022

alamb commented Jan 11, 2022

Add batch operations to stddev #1547

Add batch operations to stddev #1547

Conversation

realno commented Jan 11, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

realno commented Jan 11, 2022

alamb commented Jan 11, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

realno Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 11, 2022

realno Jan 11, 2022 •

edited

Loading