Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert variance sample to udaf #10713

Merged
merged 12 commits into from
Jun 5, 2024

Conversation

yyin-dev
Copy link
Contributor

@yyin-dev yyin-dev commented May 29, 2024

Which issue does this PR close?

Closes #10667 .

Are these changes tested?

  • New proto roundtrip tests.
  • Moved tests into sqllogictests.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels May 29, 2024
@yyin-dev yyin-dev marked this pull request as draft May 29, 2024 12:10
@yyin-dev
Copy link
Contributor Author

yyin-dev commented May 29, 2024

@jayzhan211 I'm working on a change, but can you help me understand the semantics here:

# csv_query_distinct_variance
query R
SELECT var(distinct c2) FROM aggregate_test_100
----
2.5

statement error DataFusion error: This feature is not implemented: VAR\(DISTINCT\) aggregations are not available
SELECT var(c2), var(distinct c2) FROM aggregate_test_100

Why should the first query succeed but not the second one? Feel free to point me to any SQL / datafusion doc.

@jayzhan211
Copy link
Contributor

@jayzhan211 I'm working on a change, but can you help me understand the semantics here:

# csv_query_distinct_variance
query R
SELECT var(distinct c2) FROM aggregate_test_100
----
2.5

statement error DataFusion error: This feature is not implemented: VAR\(DISTINCT\) aggregations are not available
SELECT var(c2), var(distinct c2) FROM aggregate_test_100

Why should the first query succeed but not the second one? Feel free to point me to any SQL / datafusion doc.

I think it is because of optimize rule SingleDistinctToGroupBy, this rule convert distinct to group by, so the first query is no longer distinct, you can try adding explain to see the optimized logical plan.

@yyin-dev
Copy link
Contributor Author

@jayzhan211 I'm working on a change, but can you help me understand the semantics here:

# csv_query_distinct_variance
query R
SELECT var(distinct c2) FROM aggregate_test_100
----
2.5

statement error DataFusion error: This feature is not implemented: VAR\(DISTINCT\) aggregations are not available
SELECT var(c2), var(distinct c2) FROM aggregate_test_100

Why should the first query succeed but not the second one? Feel free to point me to any SQL / datafusion doc.

I think it is because of optimize rule SingleDistinctToGroupBy, this rule convert distinct to group by, so the first query is no longer distinct, you can try adding explain to see the optimized logical plan.

Ah that makes sense. Thanks!

@yyin-dev
Copy link
Contributor Author

yyin-dev commented Jun 1, 2024

@jayzhan211 I'm working on a change, but can you help me understand the semantics here:

# csv_query_distinct_variance
query R
SELECT var(distinct c2) FROM aggregate_test_100
----
2.5

statement error DataFusion error: This feature is not implemented: VAR\(DISTINCT\) aggregations are not available
SELECT var(c2), var(distinct c2) FROM aggregate_test_100

Why should the first query succeed but not the second one? Feel free to point me to any SQL / datafusion doc.

I think it is because of optimize rule SingleDistinctToGroupBy, this rule convert distinct to group by, so the first query is no longer distinct, you can try adding explain to see the optimized logical plan.

I'm thinking about the right way to implement error-raising. Before migration, the logic was implemented in physical-exp/src/aggregate/build_in.rs:create_aggregate_expr as a match statement.

After migration, the error should probably be raised in phyical-expr-common/src/aggregate/mod.rs:create_aggregate_expr. There are two options:

  1. Get the udaf's name and implement similar logic. This is simpler but less principled?

  2. Adds a fun support_distinct(&self) -> bool to the AggregateUDFImpl trait. This feels like a better solution.

What do you think?

@jayzhan211
Copy link
Contributor

@jayzhan211 I'm working on a change, but can you help me understand the semantics here:

# csv_query_distinct_variance
query R
SELECT var(distinct c2) FROM aggregate_test_100
----
2.5

statement error DataFusion error: This feature is not implemented: VAR\(DISTINCT\) aggregations are not available
SELECT var(c2), var(distinct c2) FROM aggregate_test_100

Why should the first query succeed but not the second one? Feel free to point me to any SQL / datafusion doc.

I think it is because of optimize rule SingleDistinctToGroupBy, this rule convert distinct to group by, so the first query is no longer distinct, you can try adding explain to see the optimized logical plan.

I'm thinking about the right way to implement error-raising. Before migration, the logic was implemented in physical-exp/src/aggregate/build_in.rs:create_aggregate_expr as a match statement.

After migration, the error should probably be raised in phyical-expr-common/src/aggregate/mod.rs:create_aggregate_expr. There are two options:

  1. Get the udaf's name and implement similar logic. This is simpler but less principled?
  2. Adds a fun support_distinct(&self) -> bool to the AggregateUDFImpl trait. This feels like a better solution.

What do you think?

We have is_distinct in AccumulatorArgs, so you can return errors if you found is_distinct is true

@yyin-dev yyin-dev marked this pull request as ready for review June 4, 2024 03:24
@yyin-dev yyin-dev marked this pull request as draft June 4, 2024 03:28
@yyin-dev yyin-dev force-pushed the convert-variance-sample-to-udaf branch from 4fbdb47 to d9a1562 Compare June 4, 2024 03:32
@yyin-dev yyin-dev force-pushed the convert-variance-sample-to-udaf branch from 4e043ac to a26170a Compare June 4, 2024 03:45
@yyin-dev yyin-dev marked this pull request as ready for review June 4, 2024 03:51
@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 4, 2024

@yyin-dev There are some error left to fix

You can try ./dev/rust_lint.sh, cargo test --test sqllogictests and cargo test --lib --tests --bins --features avro,json before running the CI

@yyin-dev
Copy link
Contributor Author

yyin-dev commented Jun 4, 2024

@yyin-dev There are some error left to fix

You can try ./dev/rust_lint.sh, cargo test --test sqllogictests and cargo test --lib --tests --bins --features avro,json before running the CI

Thanks! I just ran the 3 commands and fixed all problems. Can you trigger the CI again? I don't think I can trigger it myself.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yyin-dev and @jayzhan211 -- I think this looks really nice. I had one very minor comment about use but otherwise I think this is ready to go

@@ -651,6 +653,7 @@ async fn roundtrip_expr_api() -> Result<()> {
covar_pop(lit(1.5), lit(2.2)),
sum(lit(1)),
median(lit(2)),
var_sample(lit(2.2)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Comment on lines 34 to 35
use datafusion::functions_aggregate::covariance::{covar_pop, covar_samp};
use datafusion::functions_aggregate::expr_fn::first_value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change this to use the exports from datafusion::functions_aggregate::expr_fn (which I think is the intended external facing API)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use exports from expr_fn.

@@ -149,7 +149,6 @@ impl From<protobuf::AggregateFunction> for AggregateFunction {
protobuf::AggregateFunction::Count => Self::Count,
protobuf::AggregateFunction::ApproxDistinct => Self::ApproxDistinct,
protobuf::AggregateFunction::ArrayAgg => Self::ArrayAgg,
protobuf::AggregateFunction::Variance => Self::Variance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jayzhan211
Copy link
Contributor

Thanks @yyin-dev and @alamb

@jayzhan211 jayzhan211 merged commit 70744d5 into apache:main Jun 5, 2024
25 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Without migrating tests

* Should fail VAR(DISTINCT) but doesn't

* Pass all other tests.

* Return error for var(distinct)

* Migrate tests

* Fix tests

* Lint

* Fix tests

* Fix use
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert Variance Sample to UDAF
3 participants