Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate functions Max/Min return different results based on order of NaN for floating point types. #21877

Closed
kgpai opened this issue Feb 7, 2024 · 13 comments

Comments

@kgpai
Copy link
Contributor

kgpai commented Feb 7, 2024

Aggregate functions Max/Min return different results based on when NaN is encountered in the input for floating point types. If Nan is the first value then irrespective of what the other values are the result is Nan. This seems wrong.

Expected Behavior

Max/Min shouldnt be sensitive to order of NaN and should return same result.

Current Behavior

presto:di> select max(x) from (values 4.0,nan(),null) T(x);;
 _col0
-------
   4.0
(1 row)

presto:di> select max(x) from (values nan(),4.0,null) T(x);
 _col0
-------
   NaN
(1 row)

Possible Solution

The bug is likely here (https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/aggregation/AbstractMinMaxAggregationFunction.java) where state is initially null and set to Nan and then subsequently all comparison against it fails.

@kgpai
Copy link
Contributor Author

kgpai commented Feb 7, 2024

cc: @spershin @amitkdutta

@mbasmanova
Copy link
Contributor

@hainenber
Copy link
Contributor

Hi, can I be assigned for this one? Thanks!

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Feb 11, 2024

It seems that most systems treat NaN as higher any other floating point value1234, which is not consistent with IEEE 7545. We should revisit to see if the Presto behavior is too incongruent from larger industry expectations. But it seems like we can separate out the small inconsistency in ordering of the NaN value in our min/max aggregation functions for now.

Footnotes

  1. https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#max

  2. https://docs.snowflake.com/en/sql-reference/data-types-numeric#special-values

  3. https://spark.apache.org/docs/latest/api/sql/index.html#array_max

  4. https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-NUMERIC-DECIMAL

  5. https://en.wikipedia.org/wiki/NaN

@mbasmanova
Copy link
Contributor

It seems that most systems treat NaN as higher any other floating point value

@tdcmeehan Tim, thank you for the references. Small correction: it appears that BigQuery doesn't treat NaN as greater than any other value. Both MIN and MAX in BigQuery are documented as returning NaN if any input is NaN.

"If the argument is NaN for any row in the group, returns NaN."

@Yuhta
Copy link
Contributor

Yuhta commented Feb 14, 2024

If the argument is NaN for any row in the group, returns NaN.

I would say this makes most sense.

@kgpai
Copy link
Contributor Author

kgpai commented Feb 14, 2024

@Yuhta Only BigQuery does that though not the others (spark/presto etc..).

@spershin
Copy link
Contributor

Folks, anyhow we resolve this, we better make all the min/max of doubles consistent:
array_min/array_max currently both return NaN when found at least one NaN element.
So, min/max should to the same and if not, then array_min/array_max should be changed accordingly.
Makes sense to double check min_by/max_by too.

@steveburnett
Copy link
Contributor

However this is resolved, the docs should be checked if any update is needed to match the resolution of the issue. For example, http://prestodb.io/docs/current/functions/math.html.

@mbasmanova
Copy link
Contributor

We also noticed that min(x) != min(x, 1).

presto:di> select max(x), min(x) from unnest(array[1.0, nan(), infinity()]) as t(x);
  _col0   | _col1
----------+-------
 Infinity |   1.0
(1 row)

presto:di> select max(x, 1), min(x, 1) from unnest(array[1.0, nan(), infinity()]) as t(x);
 _col0 | _col1
-------+-------
 [NaN] | [1.0]
(1 row)

@mlyublena
Copy link
Contributor

It seems that most systems treat NaN as higher any other floating point value1234, which is not consistent with IEEE 7545. We should revisit to see if the Presto behavior is too incongruent from larger industry expectations. But it seems like we can separate out the small inconsistency in ordering of the NaN value in our min/max aggregation functions for now.

Footnotes

  1. https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#max
  2. https://docs.snowflake.com/en/sql-reference/data-types-numeric#special-values
  3. https://spark.apache.org/docs/latest/api/sql/index.html#array_max
  4. https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-NUMERIC-DECIMAL
  5. https://en.wikipedia.org/wiki/NaN

One comment to the above: BigQuery actually treats NaNs as special values and for both min and max returns NaN if any of the inputs is NaN: https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#min

@mlyublena
Copy link
Contributor

In addition to the various aggs and scalar functions listed above, can we make the behavior consistent across all functions + comparisons involving NaN? In particular how do we deal with comparisons and expressions of the sort:

x > nan()
order by <column_containing_nulls>

Note that for ORDER BY with nulls some systems allow specifying where NULLs get ordered with the NULLS FIRST/NULLS LAST construct.

@rschlussel
Copy link
Contributor

I've been looking into the NaN inconsistency and commented on this issue (since it had the more relevant title) with my update #21936 (comment). Would appreciate any feedback so we can move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants