-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine low seletivity vectors generated by hash join filter #10987
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@mbasmanova and @Yuhta , could you help to take a look? |
CI failure is caused by #10871, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is great, leave some initial comments.
Related issue #7801
8198b7a
to
82442b9
Compare
82442b9
to
054d241
Compare
Hello @Yuhta and @mbasmanova , could you take a look again? Thanks. |
054d241
to
f782e70
Compare
// For boolean type and if the offset is not multiple of 8, return a shifted | ||
// copy; otherwise return a BufferView into the original buffer (with shared | ||
// ownership of original buffer). | ||
static BufferPtr sliceBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this
|
||
// Intialize 'leftSemiProjectIsNull_' for null aware lft semi join. | ||
if (isLeftSemiProjectJoin(joinType_) && nullAware_) { | ||
leftSemiProjectIsNull_.clearAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do leftSemiProjectIsNull_.resize(outputTableRowsCapacity_)
here so we don't need to do it in each loop? SelectivityVector::resize
requires iterate over all the bits in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
1f640e4
to
aec7761
Compare
aec7761
to
ba67cca
Compare
ba67cca
to
c691994
Compare
CI failure is not related to this PR, thanks. |
Combine low seletivity vectors generated by join filter
Problem
We found in TPCDS query72 that the join filter leads to a large number of
low-seletivity result vectors, which affects the performance of subsequent
operations.
Details:
The number of input vectors on the probe side of the corresponding join is
84,087, with a total of 743,851,486 rows (our batch size is set to 10,240). Due
to a large number of duplicate rows on the build side, the final result
inflates. The number of output vectors from the join is 2,806,054. The
corresponding join filter filters out a large portion of the results, so the
number of output rows is 1,430,253,235. This leads to the output of many
sparse vectors (the average batch row count is 504).
Solution
The original logic continues the loop to fill more rows if no rows pass the
filters. To resolve this issue, we can extend it to handle cases where only
partial rows pass the filters. We need to ensure that the indices
'outputRowMapping_' and 'outputTableRows_' are filled as much as possible until
we either reach the preferred batch size or have processed all rows in the
current input vector.
This approach will not only address the issue mentioned above but also avoid
unnecessary data copying.