Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make consistent behavior on zeros equality on floating point types #3510

Merged
merged 4 commits into from
Jan 13, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Jan 11, 2023

Which issue does this PR close?

Closes #3509.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 11, 2023
@tustvold
Copy link
Contributor

I'm not sure about this, as it means we no longer are comparing with respect to a standard predicate but one of our own devising. Why special case zero, and not other values like NaNs?

I'd also be interested to know what impact this has on benchmarks.

FWIW If we do make this change, we will need to make changes to normalise within the row format, along with potentially in other places also. Nothing insurmountable, just noting it

{
let left: PrimitiveArray<T> = PrimitiveArray::from(left.data().clone());
let right: PrimitiveArray<T> = PrimitiveArray::from(right.data().clone());
Box::new(move |i, j| left.value(i).cmp(&right.value(j)))
Box::new(move |i, j| left.value(i).compare(right.value(j)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 regardless this is a good change

@viirya
Copy link
Member Author

viirya commented Jan 11, 2023

NaNs are treated as equal by total ordering. I guess total ordering needs to give a comprehensive ordering for possible floating point values. But in practice computation, we don't actually separate positive and negative zeros.

@tustvold
Copy link
Contributor

NaNs are treated as equal by total ordering

Not the ordering we use currently, they're ordered based on their constituent bits, NaNs with different byte representations will not compare equal

@viirya
Copy link
Member Author

viirya commented Jan 11, 2023

Not the ordering we use currently, they're ordered based on their constituent bits, NaNs with different byte representations will not compare equal

We have NaN equality test to verify that they are equal. I also did a quick verification in rust playground:

fn main() {
    let a = f32::NAN;
    let b = f32::NAN;
    
    println!("a == b: {}", a.to_bits() == b.to_bits());
}

Output:

a == b: true

@tustvold
Copy link
Contributor

tustvold commented Jan 12, 2023

f32::NaN always returns the same NaN bytes, if you get a NaN by other means such that they have different bit representations you will see the difference

Edit: in fact comparing NaN with -NaN will probably show this

@viirya
Copy link
Member Author

viirya commented Jan 12, 2023

I see. That explains why these NaNs are equal. I roughly remember that from JVM experience NaN values' bits are different so I was a bit surprised to see they are equal in above test/play-ground. If there are other bit patterns in Rust that will be seen as NaN too, then it is not guaranteed to be equal.

NaNs should be treated as equal in computation too, like zeros.

Either adding NaN-specific condition like zero, or we avoid such things here and require users to handle it before calling arrow kernels. For example, replacing negative zeros with positive zeros, normalizing NaNs with standard f32::NaN (f64, f16 too).

@github-actions github-actions bot removed the arrow Changes to the arrow crate label Jan 12, 2023
Comment on lines 366 to 372
if self.abs() == $zero && rhs.abs() == $zero {
// `total_cmp` treats positive zero and negative zero as different.
// But for computation system, it usually treats them as equal.
Ordering::Equal
} else {
<$t>::total_cmp(&self, &rhs)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these changes.

Comment on lines +694 to +695
/// Note that totalOrder treats positive and negative zeros are different. If it is necessary
/// to treat them as equal, please normalize zeros before calling this kernel.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these docs to make the behavior clear to users.

Comment on lines +361 to +362
assert_eq!(Ordering::Less, (cmp)(0, 1));
assert_eq!(Ordering::Greater, (cmp)(1, 0));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_compare's behavior on zeros comparison is inconsistent with comparison kernels. Changed it to consistent.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@jhorstmann
Copy link
Contributor

Looks good to me!

Just a note that the min/max aggregation kernels also use a different definition, I think following the postgres behavior of considering NaN to be greater than any other value.

@tustvold
Copy link
Contributor

I think following the postgres behavior of considering NaN to be greater than any other value

Yeah it is honestly baffling to me that they took so long to define a total ordering predicate, we now have a standard but few people follow it 😅

@tustvold tustvold merged commit d49cd21 into apache:master Jan 13, 2023
@ursabot
Copy link

ursabot commented Jan 13, 2023

Benchmark runs are scheduled for baseline = 8688dba and contender = d49cd21. d49cd21 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make consistent behavior on zeros equality on floating point types
4 participants