Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add optimizer config param to avoid grouping partitions prefer_existing_union #10259

Merged
merged 3 commits into from
Apr 30, 2024

Conversation

NGA-TRAN
Copy link
Contributor

Which issue does this PR close?

Closes ##10257

Rationale for this change

What changes are included in this PR?

Add an option to avoid converting Union to Interleave

Are these changes tested?

Yes

Are there any user-facing changes?

No, because by the fault, nothing is changed

@github-actions github-actions bot added the core Core DataFusion crate label Apr 26, 2024
plan,
!first_enforce_distribution,
prefer_existing_sort,
prefer_existing_union
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observing the test union_to_interleave above it. If it covers all needed cases, these 2 tests will cover all needed cases for the not-convert, too. Let me know if it is not the case

@NGA-TRAN
Copy link
Contributor Author

@alamb Is this PR is good enough? We do not need to set prefer_existing_union in DF, right? It will be set in IOx?

plan = if plan.as_any().is::<UnionExec>() && can_interleave(children_plans.iter()) {

plan = if plan.as_any().is::<UnionExec>()
&& !config.optimizer.prefer_existing_union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that one of the motivations (influxdata#4) for this new flag is to preserve the sorting of the Union - would re-using the existing flag prefer_existing_sort make sense here?

One argument against that would be if you wanted prefer_existing_sort: false and prefer_existing_union: true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that one of the motivations (influxdata#4) for this new flag is to preserve the sorting of the Union - would re-using the existing flag prefer_existing_sort make sense here?

One argument against that would be if you wanted prefer_existing_sort: false and prefer_existing_union: true.

This is an excellent point @phillipleblanc . I just double checked and we actually have prefer_existing_sort set to true in IOx already (code ref, not public)

What do you think about using the existing flag @NGA-TRAN ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phillipleblanc and @alamb
@mustafasrepo 's comment #10259 (comment) suggest to use the current approach. Do I need to do anything more here?

Copy link
Contributor

@phillipleblanc phillipleblanc Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think this approach looks good! Thanks!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NGA-TRAN -- this is looking like it is pretty close in my opinion

I think @phillipleblanc 's suggestions are quite good and we should consider them carefully

plan = if plan.as_any().is::<UnionExec>() && can_interleave(children_plans.iter()) {

plan = if plan.as_any().is::<UnionExec>()
&& !config.optimizer.prefer_existing_union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that one of the motivations (influxdata#4) for this new flag is to preserve the sorting of the Union - would re-using the existing flag prefer_existing_sort make sense here?

One argument against that would be if you wanted prefer_existing_sort: false and prefer_existing_union: true.

This is an excellent point @phillipleblanc . I just double checked and we actually have prefer_existing_sort set to true in IOx already (code ref, not public)

What do you think about using the existing flag @NGA-TRAN ?

};

($EXPECTED_LINES: expr, $PLAN: expr, $FIRST_ENFORCE_DIST: expr, $PREFER_EXISTING_SORT: expr) => {
assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, 10, false, 1024);
assert_optimized!($EXPECTED_LINES, $PLAN, $FIRST_ENFORCE_DIST, $PREFER_EXISTING_SORT, 10, false, 1024, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these macros are getting a bit hairy -- maybe we can clean them up (convert to using functions rather than macros) in a subsequent PR

@alamb
Copy link
Contributor

alamb commented Apr 27, 2024

cc @mustafasrepo and @ozankabak for your thoughts

@ozankabak
Copy link
Contributor

Thanks @NGA-TRAN and @alamb, we will take a look and circle back on Monday

@mustafasrepo
Copy link
Contributor

I think having dedicated config setting is more verbose and clear (as in prefer_existing_union). If we were to use prefer_existing_sort that might also work. However, if the condition to replace UnionExec to InterleaveExec is changed to

plan.as_any().is::<UnionExec>() 
&& !config.optimizer.prefer_existing_sort 
&& can_interleave(children_plans.iter())

this will prefer UnionExec instead of InterleaveExec even if inputs of the UnionExec is unordered when the config.optimizer.prefer_existing_sort flag is true. Which might be counter intuitive given there is no ordering to preserve. However, config.optimizer.prefer_existing_union does exactly what it says. Hence, it is a bit clearer to me. Hence, I think it is better to proceed with current approach in this PR.

In the future, if we add support for OrderPreservingInterleaveExec (this might be accomplished by replacing CombinedRecordBatchStream with streaming_merge in the fn execute method of the InterleaveExec.)
using the flag config.optimizer.prefer_existing_sort to decide between InterleaveExec and OrderPreservingInterleaveExec might solve the issue. This approach may invalidate the requirement for prefer_existing_union setting. However, until we have this support current approach is much more clear.

@alamb
Copy link
Contributor

alamb commented Apr 29, 2024

Hence, I think it is better to proceed with current approach in this PR. (as in prefer_existing_union)

Sounds good to me.

In the future, if we add support for OrderPreservingInterleaveExec (this might be accomplished by replacing CombinedRecordBatchStream with streaming_merge in the fn execute method of the InterleaveExec.)

This is an excellent idea, I will file a ticket to track the idea

@NGA-TRAN can you get the tests passing on this PR and we'll give it another review?

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 29, 2024
@NGA-TRAN
Copy link
Contributor Author

@alamb @phillipleblanc and @mustafasrepo
all the tests have passed

@alamb alamb changed the title feat: add a config param to avoid grouping partitions feat: add optimizer config param to avoid grouping partitions prefer_existing_union Apr 29, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me.

Thank you for the reviews @phillipleblanc @mustafasrepo and @ozankabak

I plan to file the follow on tickets (e.g. about sort preserving interleave) tomorrow

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is LGTM! Thanks @NGA-TRAN

Copy link
Contributor

@phillipleblanc phillipleblanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

plan = if plan.as_any().is::<UnionExec>() && can_interleave(children_plans.iter()) {

plan = if plan.as_any().is::<UnionExec>()
&& !config.optimizer.prefer_existing_union
Copy link
Contributor

@phillipleblanc phillipleblanc Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think this approach looks good! Thanks!

@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

Since this PR is good to go, merging it in. I am also in the process of filing the follow on work

@alamb alamb merged commit 2231183 into apache:main Apr 30, 2024
25 checks passed
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 30, 2024
…_existing_union` (apache#10259)

* feat: add a config param to avoid converting union to interleave

* chore: update config for the tests

* chore: update configs.md
@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

Filed #10314 to track order preserving UnionExec

@alamb
Copy link
Contributor

alamb commented Apr 30, 2024

I also started collecting sorted related optimizations in #10313 as well

appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 30, 2024
…_existing_union` (apache#10259)

* feat: add a config param to avoid converting union to interleave

* chore: update config for the tests

* chore: update configs.md
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request May 1, 2024
…_existing_union` (apache#10259)

* feat: add a config param to avoid converting union to interleave

* chore: update config for the tests

* chore: update configs.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants