PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649

jp0317 · 2023-07-12T18:18:39Z

Rationale for this change

In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader.

In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead.

Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns.

What changes are included in this PR?

Switch from a hash set to a bitmap buffer.

Are these changes tested?

Yes, passed unit tests on partial prebuffer.

Are there any user-facing changes?

No.

github-actions · 2023-07-12T18:19:08Z

https://issues.apache.org/jira/browse/PARQUET-2323

github-actions · 2023-07-12T18:19:10Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

wgtmac · 2023-07-13T03:19:21Z

Thanks for the change! May I know more about the context? It seems that all your changes are related to minimizing the memory footprint.

jp0317 · 2023-07-13T16:19:19Z

sure, the motivation is that we have a use case with memory limit on IO. Then we found that currently the only option is to set the ReadProperties when creating the file reader, but with that all column chunks, disregarding their sizes, will be read in the same way as specified in the properties.

mapleFU · 2023-07-13T16:24:22Z

Would AllocateBitmap be better? Since memory of it would be tracked in MemoryPool.

jp0317 · 2023-07-13T18:42:44Z

Thanks @mapleFU! Changed to AllocateBitmap.

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

cpp/src/parquet/file_reader.cc

mapleFU

Rest LGTM

cpp/src/parquet/file_reader.cc

wgtmac

LGTM, thanks!

mapleFU · 2023-07-19T04:16:00Z

cpp/src/parquet/file_reader.cc

+      int num_cols = file_metadata_->num_columns();
+      PARQUET_THROW_NOT_OK(
+          AllocateBitmap(num_cols, properties_.memory_pool()).Value(&col_bitmap));
+      memset(col_bitmap->mutable_data(), 0, col_bitmap->size());


nit: Would you mind use AllocateEmptyBitmap instead?

sure, that's better, thanks!

replace mutable_data() with data() when checking bitmap in Buffer Co-authored-by: Gang Wu <ustcwg@gmail.com>

pitrou · 2023-07-19T08:29:11Z

CI failure looks unrelated, will merge.

…pache#36649) ### Rationale for this change In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader. In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns. ### What changes are included in this PR? Switch from a hash set to a bitmap buffer. ### Are these changes tested? Yes, passed unit tests on partial prebuffer. ### Are there any user-facing changes? No. Lead-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Jinpeng <zjpzlz@163.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…ltiple times (#36774) ### Rationale for this change According to #36192 and #36649 . RowGroupReader using a bitmap to control a column-level prebuffer. However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times. ### What changes are included in this PR? Build `Prebuffer` Bitmap once, and reuse that vector. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #36773 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

conbench-apache-arrow · 2023-07-27T17:17:54Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7ad3003.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

…36649) ### Rationale for this change In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader. In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns. ### What changes are included in this PR? Switch from a hash set to a bitmap buffer. ### Are these changes tested? Yes, passed unit tests on partial prebuffer. ### Are there any user-facing changes? No. Lead-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Jinpeng <zjpzlz@163.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…pache#36649) ### Rationale for this change In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader. In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns. ### What changes are included in this PR? Switch from a hash set to a bitmap buffer. ### Are these changes tested? Yes, passed unit tests on partial prebuffer. ### Are there any user-facing changes? No. Lead-authored-by: jp0317 <zjpzlz@gmail.com> Co-authored-by: Jinpeng <zjpzlz@163.com> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…map multiple times (apache#36774) ### Rationale for this change According to apache#36192 and apache#36649 . RowGroupReader using a bitmap to control a column-level prebuffer. However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times. ### What changes are included in this PR? Build `Prebuffer` Bitmap once, and reuse that vector. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#36773 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

jp0317 requested a review from wgtmac as a code owner July 12, 2023 18:18

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jul 12, 2023

wgtmac reviewed Jul 16, 2023

View reviewed changes

cpp/src/parquet/arrow/arrow_reader_writer_test.cc Show resolved Hide resolved

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 16, 2023

jp0317 requested a review from wgtmac July 16, 2023 19:30

mapleFU reviewed Jul 17, 2023

View reviewed changes

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved

jp0317 force-pushed the bitvector branch from 5933fec to 1651f40 Compare July 17, 2023 15:56

wgtmac approved these changes Jul 19, 2023

View reviewed changes

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved

wgtmac changed the title ~~PARQUET-2323: [C++] use bit vector to store prebuffered column chunks~~ PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks Jul 19, 2023

jp0317 added 3 commits July 19, 2023 03:35

use bit vector to store prebuffered column chunks

a576829

use AllocateBitmap

fd1ae97

use memset to clear bit map and reword a few comments

04adeb2

jp0317 force-pushed the bitvector branch from 97c63fa to 0320a12 Compare July 19, 2023 03:35

wgtmac approved these changes Jul 19, 2023

View reviewed changes

mapleFU approved these changes Jul 19, 2023

View reviewed changes

mapleFU reviewed Jul 19, 2023

View reviewed changes

Update cpp/src/parquet/file_reader.cc

8c1482b

replace mutable_data() with data() when checking bitmap in Buffer Co-authored-by: Gang Wu <ustcwg@gmail.com>

jp0317 force-pushed the bitvector branch from 0320a12 to 8c1482b Compare July 19, 2023 05:17

pitrou merged commit 7ad3003 into apache:main Jul 19, 2023
30 of 32 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Jul 19, 2023

jp0317 deleted the bitvector branch July 19, 2023 15:39

This was referenced Jul 19, 2023

[C++][Parquet] Prebuffer: Avoid calculating prebuffer column bitmap multiple times #36773

Closed

GH-36773: [C++][Parquet] Avoid calculating prebuffer column bitmap multiple times #36774

Merged

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Use bit vector to store Prebuffered column chunk index #43007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649

PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649

jp0317 commented Jul 12, 2023 •

edited by pitrou

Loading

github-actions bot commented Jul 12, 2023

github-actions bot commented Jul 12, 2023

wgtmac commented Jul 13, 2023

jp0317 commented Jul 13, 2023

mapleFU commented Jul 13, 2023

jp0317 commented Jul 13, 2023

mapleFU left a comment

wgtmac left a comment

mapleFU Jul 19, 2023

jp0317 Jul 19, 2023

pitrou commented Jul 19, 2023

conbench-apache-arrow bot commented Jul 27, 2023

PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649

PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649

Conversation

jp0317 commented Jul 12, 2023 • edited by pitrou Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jul 12, 2023

github-actions bot commented Jul 12, 2023

wgtmac commented Jul 13, 2023

jp0317 commented Jul 13, 2023

mapleFU commented Jul 13, 2023

jp0317 commented Jul 13, 2023

mapleFU left a comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

mapleFU Jul 19, 2023

Choose a reason for hiding this comment

jp0317 Jul 19, 2023

Choose a reason for hiding this comment

pitrou commented Jul 19, 2023

conbench-apache-arrow bot commented Jul 27, 2023

jp0317 commented Jul 12, 2023 •

edited by pitrou

Loading