Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve correctness of stddev and variance with partial aggregation #23447

Merged
merged 1 commit into from
Aug 15, 2024

Conversation

rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Aug 14, 2024

Description

When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers.

Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.

Motivation and Context

Queries with 0 variance on large values can return inconsistent and incorrect stddev due to error introduced by floating point arithmetic. For example, see the following result for a stddev and variance computations over a constant.

presto> SELECT count(*), stddev(6523763181031200), variance(6523763181031200) from test_table;
  _col0   |       _col1        |      _col2       
----------+--------------------+------------------
 61782553 | 2.2677238191332383 | 5.14257131986424 
(1 row)

that same query with partial aggregation disabled returns correct results


presto> set session prefer_partial_aggregation=false;
SET SESSION           
presto> SELECT count(*), stddev(6523763181031200), variance(6523763181031200) from test_table;
  _col0   | _col1 | _col2 
----------+-------+-------
 61782553 |   0.0 |   0.0 
(1 row)

This change reduces the amount of error we introduce in merging variance states during partial aggregation for certain cases to improve the accuracy of our variance and stddev functions.

Impact

Ensures that when variance or stddev is zero, results are always correct, and reduces the error we introduce for other cases.

Test Plan

new unit tests
production verifier run (in progress)

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Improve stddev and variance functions to always return correct results when input is constant. :pr:`23447`

When merging varianceStates for partial aggregation, if the current
state has zero rows, use the values from the other state without doing
computation. This prevents introducing error due to imprecision in
floating point numbers.

Additionally, change the way we combine means. This ensures that we
do not introduce error due to imprecision in multiplication/division
when the delta is 0. I think it should in general improve the error
introduced by the mean computation, but I don't have a rigorous proof
or even experimental data for this.
@steveburnett
Copy link
Contributor

Nit: suggest a minor edit to release notes entry following the Order of Changes in the Release Notes Guidelines, based on the commit message. Please modify my suggestion if you think of a better wording!

== RELEASE NOTES ==

General Changes
* Improve stddev and variance functions to always return correct results when input is constant. :pr:`23447`

@rschlussel rschlussel marked this pull request as ready for review August 14, 2024 20:43
Copy link
Contributor

@amitkdutta amitkdutta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks @rschlussel
Additionally, this will remove verification noise between native and java engines, as native engine computes it properly today with identical values for statistical aggregates (e.g. stddev, variance)

Copy link
Member

@zacw7 zacw7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix!

@amitkdutta amitkdutta merged commit 4f74f52 into prestodb:master Aug 15, 2024
57 checks passed
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants