-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Iceberg read when positional delete files are unaligned #10261
Conversation
✅ Deploy Preview for meta-velox canceled.
|
40bf1c3
to
f965154
Compare
|
||
return rowsScanned; | ||
} | ||
|
||
void IcebergSplitReader::shiftDeleteBitmap() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use bits::copyBits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is definitely room to improve copyBits
. The overall timing of that function in the query is not large though.
} | ||
|
||
mutation.deletedRows = | ||
deleteBitmap_ ? deleteBitmap_->as<uint64_t>() : nullptr; | ||
|
||
auto rowsScanned = baseRowReader_->next(size, output, &mutation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function to get actual scanning size in row reader called nextReadSize
. There is also nextRowNumber
to help you track the number of rows scanned on base file. You can consider using these methods to make reading delta positions a little easier and fewer states to maintain (maybe in subsequent diffs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Thanks Jimmy! I checked these functions and yes we can get the actual rows scanned before calling base RowReader's next(). So I'll send a separate PR to simplify the logic using these functions.
For now I updated the PR with your another comment addressed. Would appreciate your review again.
f965154
to
b23f34c
Compare
This commit makes it possible to create base data files and delete fiels with un-aligned Rowgroup boundaries. It also added several new test cases and improved some variable and function namings.
b23f34c
to
f848d6e
Compare
@Yuhta Would you please review again? Thank you! |
@Yuhta Thanks for reviewing and approving this PR. Do we know when it will be merged? |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some lint errors
When the base data file and positional delete files contains multiple unaligned RowGroups, some of the bits at the end of IcebergSplitReader::deleteBitmap_ could be mistakenly skipped, causing wrong result. This commit fixes it by introducing an offset into this deleteBitmap_ and shift the unused bits to the beginning for each batch.
f848d6e
to
e4df770
Compare
@Yuhta Thanks for importing the PR and let me know the failures! I have updated the PR with your comments addressed. |
@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This pull request has been reverted by e8a73ae. |
When the base data file and positional delete files contains multiple
unaligned RowGroups, some of the bits at the end of
IcebergSplitReader::deleteBitmap_ could be mistakenly skipped, causing
wrong result. This commit fixes it by introducing an offset into this
deleteBitmap_ and shift the unused bits to the beginning for each batch.
Fixes #9856