Improve correctness of stddev and variance with partial aggregation #23447

rschlussel · 2024-08-14T19:43:41Z

Description

When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers.

Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.

Motivation and Context

Queries with 0 variance on large values can return inconsistent and incorrect stddev due to error introduced by floating point arithmetic. For example, see the following result for a stddev and variance computations over a constant.

presto> SELECT count(*), stddev(6523763181031200), variance(6523763181031200) from test_table;
  _col0   |       _col1        |      _col2       
----------+--------------------+------------------
 61782553 | 2.2677238191332383 | 5.14257131986424 
(1 row)

that same query with partial aggregation disabled returns correct results


presto> set session prefer_partial_aggregation=false;
SET SESSION           
presto> SELECT count(*), stddev(6523763181031200), variance(6523763181031200) from test_table;
  _col0   | _col1 | _col2 
----------+-------+-------
 61782553 |   0.0 |   0.0 
(1 row)

This change reduces the amount of error we introduce in merging variance states during partial aggregation for certain cases to improve the accuracy of our variance and stddev functions.

Impact

Ensures that when variance or stddev is zero, results are always correct, and reduces the error we introduce for other cases.

Test Plan

new unit tests
production verifier run (in progress)

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Improve stddev and variance functions to always return correct results when input is constant. :pr:`23447`

When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers. Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.

steveburnett · 2024-08-14T20:28:59Z

Nit: suggest a minor edit to release notes entry following the Order of Changes in the Release Notes Guidelines, based on the commit message. Please modify my suggestion if you think of a better wording!

== RELEASE NOTES ==

General Changes
* Improve stddev and variance functions to always return correct results when input is constant. :pr:`23447`

amitkdutta

Looks great. Thanks @rschlussel
Additionally, this will remove verification noise between native and java engines, as native engine computes it properly today with identical values for statistical aggregates (e.g. stddev, variance)

zacw7

Thanks for the prompt fix!

rschlussel marked this pull request as ready for review August 14, 2024 20:43

rschlussel requested review from jaystarshot, feilong-liu, elharo, ClarenceThreepwood and a team as code owners August 14, 2024 20:43

rschlussel requested a review from presto-oss August 14, 2024 20:43

amitkdutta approved these changes Aug 14, 2024

View reviewed changes

zacw7 approved these changes Aug 14, 2024

View reviewed changes

elharo approved these changes Aug 14, 2024

View reviewed changes

feilong-liu approved these changes Aug 14, 2024

View reviewed changes

amitkdutta merged commit 4f74f52 into prestodb:master Aug 15, 2024
57 checks passed

tdcmeehan mentioned this pull request Aug 23, 2024

Add release notes for 0.289 #23513

Merged

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve correctness of stddev and variance with partial aggregation #23447

Improve correctness of stddev and variance with partial aggregation #23447

rschlussel commented Aug 14, 2024 •

edited

Loading

steveburnett commented Aug 14, 2024

amitkdutta left a comment

zacw7 left a comment

Improve correctness of stddev and variance with partial aggregation #23447

Improve correctness of stddev and variance with partial aggregation #23447

Conversation

rschlussel commented Aug 14, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

steveburnett commented Aug 14, 2024

amitkdutta left a comment

Choose a reason for hiding this comment

zacw7 left a comment

Choose a reason for hiding this comment

rschlussel commented Aug 14, 2024 •

edited

Loading