[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

mustafasrepo · 2023-10-16T10:24:36Z

Which issue does this PR close?

Closes #.

Rationale for this change

As observed in discussion bu @alamb. Currently we add RoundRobin repartition when we know that input row number is 1(repartition is not helpful). This PR fixes this problem.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak

Thanks for the quick fix, did a review and it LGTM.

Dandandan · 2023-10-16T12:02:01Z

datafusion/core/src/physical_optimizer/enforce_distribution.rs

+            // Don't need to apply when the returned row count is not greater than 1:
+            let stats = child.statistics();
+            let repartition_beneficial_stats = if stats.is_exact {
+                stats.num_rows.map(|num_rows| num_rows > 1).unwrap_or(true)


Given that repartitioning is only useful when having multiple batches, we can consider changing this to:
num_rows > batch_size

Makes sense, I will change accordingly.

@Dandandan I updated check as you suggested. Some of the existing tests changes with this change. I think, changes are for the better. However, I would appreciate If you can double check them.

alamb

Makes sense to me -- thank you @mustafasrepo and @ozankabak

Let's wait for @Dandandan to respond prior to merging though

alamb · 2023-10-16T15:52:49Z

datafusion/physical-plan/src/aggregates/mod.rs

+                        .map(|num_rows| num_rows <= 1)
+                        .unwrap_or(false));
+                Statistics {
+                    // the output row count is surely not larger than its input row count


I don't know if it matters, but the output rows could be larger than the input rows for COUNT(*) queries -- specifically if there are no input rows, COUNT(*) still produces an output row 🤔

❯ create table t(x int) as values (1); 0 rows in set. Query took 0.001 seconds. ❯ select count(*) from t where x > 1000; +----------+ | COUNT(*) | +----------+ | 0 | +----------+

Yes you are right. I missed that. I think the safest way is to check num_rows == 1. Changed accordingly

Dandandan · 2023-10-17T08:18:54Z

Thanks @mustafasrepo and @ozankabak !

Initial commit

b98cfa2

mustafasrepo changed the title ~~Do not introduce unnecessary repartition when row count is 1.~~ [MINOR]:Do not introduce unnecessary repartition when row count is 1. Oct 16, 2023

mustafasrepo mentioned this pull request Oct 16, 2023

Refactor Statistics, introduce precision estimates (Exact, Inexact, Absent) #7793

Merged

Fix failing tests

cb0596c

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 16, 2023

More idiomatic expressions

a82fa1c

ozankabak approved these changes Oct 16, 2023

View reviewed changes

Dandandan reviewed Oct 16, 2023

View reviewed changes

mustafasrepo added 2 commits October 16, 2023 16:25

Update tests, use batch size during partition benefit check

c6f188e

Fix failing tests

07027de

alamb approved these changes Oct 16, 2023

View reviewed changes

is_exact when row count is 1

233377f

Dandandan approved these changes Oct 17, 2023

View reviewed changes

Dandandan merged commit 9aacdee into apache:main Oct 17, 2023
22 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

mustafasrepo commented Oct 16, 2023 •

edited

Loading

ozankabak left a comment •

edited

Loading

Dandandan Oct 16, 2023 •

edited

Loading

mustafasrepo Oct 16, 2023

mustafasrepo Oct 16, 2023

alamb left a comment

alamb Oct 16, 2023

mustafasrepo Oct 17, 2023 •

edited

Loading

Dandandan commented Oct 17, 2023

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832

Conversation

mustafasrepo commented Oct 16, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak left a comment • edited Loading

Choose a reason for hiding this comment

Dandandan Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

mustafasrepo Oct 16, 2023

Choose a reason for hiding this comment

mustafasrepo Oct 16, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 16, 2023

Choose a reason for hiding this comment

mustafasrepo Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Dandandan commented Oct 17, 2023

mustafasrepo commented Oct 16, 2023 •

edited

Loading

ozankabak left a comment •

edited

Loading

Dandandan Oct 16, 2023 •

edited

Loading

mustafasrepo Oct 17, 2023 •

edited

Loading