-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649
Conversation
|
Thanks for the change! May I know more about the context? It seems that all your changes are related to minimizing the memory footprint. |
sure, the motivation is that we have a use case with memory limit on IO. Then we found that currently the only option is to set the |
Would |
Thanks @mapleFU! Changed to AllocateBitmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
cpp/src/parquet/file_reader.cc
Outdated
int num_cols = file_metadata_->num_columns(); | ||
PARQUET_THROW_NOT_OK( | ||
AllocateBitmap(num_cols, properties_.memory_pool()).Value(&col_bitmap)); | ||
memset(col_bitmap->mutable_data(), 0, col_bitmap->size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would you mind use AllocateEmptyBitmap
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that's better, thanks!
replace mutable_data() with data() when checking bitmap in Buffer Co-authored-by: Gang Wu <ustcwg@gmail.com>
CI failure looks unrelated, will merge. |
…pache#36649) ### Rationale for this change In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader. In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns. ### What changes are included in this PR? Switch from a hash set to a bitmap buffer. ### Are these changes tested? Yes, passed unit tests on partial prebuffer. ### Are there any user-facing changes? No. Lead-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Jinpeng <zjpzlz@163.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ltiple times (#36774) ### Rationale for this change According to #36192 and #36649 . RowGroupReader using a bitmap to control a column-level prebuffer. However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times. ### What changes are included in this PR? Build `Prebuffer` Bitmap once, and reuse that vector. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #36773 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7ad3003. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…36649) ### Rationale for this change In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader. In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns. ### What changes are included in this PR? Switch from a hash set to a bitmap buffer. ### Are these changes tested? Yes, passed unit tests on partial prebuffer. ### Are there any user-facing changes? No. Lead-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Jinpeng <zjpzlz@163.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…pache#36649) ### Rationale for this change In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader. In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns. ### What changes are included in this PR? Switch from a hash set to a bitmap buffer. ### Are these changes tested? Yes, passed unit tests on partial prebuffer. ### Are there any user-facing changes? No. Lead-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Jinpeng <zjpzlz@163.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…map multiple times (apache#36774) ### Rationale for this change According to apache#36192 and apache#36649 . RowGroupReader using a bitmap to control a column-level prebuffer. However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times. ### What changes are included in this PR? Build `Prebuffer` Bitmap once, and reuse that vector. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#36773 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…map multiple times (apache#36774) ### Rationale for this change According to apache#36192 and apache#36649 . RowGroupReader using a bitmap to control a column-level prebuffer. However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times. ### What changes are included in this PR? Build `Prebuffer` Bitmap once, and reuse that vector. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#36773 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader.
In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead.
Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns.
What changes are included in this PR?
Switch from a hash set to a bitmap buffer.
Are these changes tested?
Yes, passed unit tests on partial prebuffer.
Are there any user-facing changes?
No.