-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832
[MINOR]:Do not introduce unnecessary repartition when row count is 1. #7832
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix, did a review and it LGTM.
// Don't need to apply when the returned row count is not greater than 1: | ||
let stats = child.statistics(); | ||
let repartition_beneficial_stats = if stats.is_exact { | ||
stats.num_rows.map(|num_rows| num_rows > 1).unwrap_or(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that repartitioning is only useful when having multiple batches, we can consider changing this to:
num_rows > batch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will change accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan I updated check as you suggested. Some of the existing tests changes with this change. I think, changes are for the better. However, I would appreciate If you can double check them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you @mustafasrepo and @ozankabak
Let's wait for @Dandandan to respond prior to merging though
.map(|num_rows| num_rows <= 1) | ||
.unwrap_or(false)); | ||
Statistics { | ||
// the output row count is surely not larger than its input row count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it matters, but the output rows could be larger than the input rows for COUNT(*)
queries -- specifically if there are no input rows, COUNT(*) still produces an output row 🤔
❯ create table t(x int) as values (1);
0 rows in set. Query took 0.001 seconds.
❯ select count(*) from t where x > 1000;
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right. I missed that. I think the safest way is to check num_rows == 1
. Changed accordingly
Thanks @mustafasrepo and @ozankabak ! |
Which issue does this PR close?
Closes #.
Rationale for this change
As observed in discussion bu @alamb. Currently we add RoundRobin repartition when we know that input row number is 1(repartition is not helpful). This PR fixes this problem.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?