Improve correctness of stddev and variance with partial aggregation #23447
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers.
Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.
Motivation and Context
Queries with 0 variance on large values can return inconsistent and incorrect stddev due to error introduced by floating point arithmetic. For example, see the following result for a stddev and variance computations over a constant.
that same query with partial aggregation disabled returns correct results
This change reduces the amount of error we introduce in merging variance states during partial aggregation for certain cases to improve the accuracy of our variance and stddev functions.
Impact
Ensures that when variance or stddev is zero, results are always correct, and reduces the error we introduce for other cases.
Test Plan
new unit tests
production verifier run (in progress)
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.