Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks #36649

Merged
merged 4 commits into from
Jul 19, 2023

Conversation

jp0317
Copy link
Contributor

@jp0317 jp0317 commented Jul 12, 2023

Rationale for this change

In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader.

In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. 

Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns.

What changes are included in this PR?

Switch from a hash set to a bitmap buffer.

Are these changes tested?

Yes, passed unit tests on partial prebuffer.

Are there any user-facing changes?

No.

@jp0317 jp0317 requested a review from wgtmac as a code owner July 12, 2023 18:18
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@wgtmac
Copy link
Member

wgtmac commented Jul 13, 2023

Thanks for the change! May I know more about the context? It seems that all your changes are related to minimizing the memory footprint.

@jp0317
Copy link
Contributor Author

jp0317 commented Jul 13, 2023

sure, the motivation is that we have a use case with memory limit on IO. Then we found that currently the only option is to set the ReadProperties when creating the file reader, but with that all column chunks, disregarding their sizes, will be read in the same way as specified in the properties.

@mapleFU
Copy link
Member

mapleFU commented Jul 13, 2023

Would AllocateBitmap be better? Since memory of it would be tracked in MemoryPool.

@jp0317
Copy link
Contributor Author

jp0317 commented Jul 13, 2023

Thanks @mapleFU! Changed to AllocateBitmap.

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 16, 2023
@jp0317 jp0317 requested a review from wgtmac July 16, 2023 19:30
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved
cpp/src/parquet/file_reader.cc Outdated Show resolved Hide resolved
@wgtmac wgtmac changed the title PARQUET-2323: [C++] use bit vector to store prebuffered column chunks PARQUET-2323: [C++] Use bitmap to store pre-buffered column chunks Jul 19, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

int num_cols = file_metadata_->num_columns();
PARQUET_THROW_NOT_OK(
AllocateBitmap(num_cols, properties_.memory_pool()).Value(&col_bitmap));
memset(col_bitmap->mutable_data(), 0, col_bitmap->size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would you mind use AllocateEmptyBitmap instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's better, thanks!

replace mutable_data() with data() when checking bitmap in Buffer

Co-authored-by: Gang Wu <ustcwg@gmail.com>
@pitrou
Copy link
Member

pitrou commented Jul 19, 2023

CI failure looks unrelated, will merge.

@pitrou pitrou merged commit 7ad3003 into apache:main Jul 19, 2023
30 of 32 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jul 19, 2023
@jp0317 jp0317 deleted the bitvector branch July 19, 2023 15:39
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request Jul 20, 2023
…pache#36649)

### Rationale for this change

In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader.

In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. 

Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns.

### What changes are included in this PR?

Switch from a hash set to a bitmap buffer.

### Are these changes tested?

Yes, passed unit tests on partial prebuffer.

### Are there any user-facing changes?

No.

Lead-authored-by: jp0317 <zjpzlz@gmail.com>
Co-authored-by: Jinpeng <zjpzlz@163.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou pushed a commit that referenced this pull request Jul 24, 2023
…ltiple times (#36774)

### Rationale for this change

According to #36192 and #36649 . RowGroupReader using a bitmap to control a column-level prebuffer.

However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times.

### What changes are included in this PR?

Build `Prebuffer` Bitmap once, and reuse that vector.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: #36773

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7ad3003.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

raulcd pushed a commit that referenced this pull request Jul 28, 2023
…36649)

### Rationale for this change

In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader.

In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. 

Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns.

### What changes are included in this PR?

Switch from a hash set to a bitmap buffer.

### Are these changes tested?

Yes, passed unit tests on partial prebuffer.

### Are there any user-facing changes?

No.

Lead-authored-by: jp0317 <zjpzlz@gmail.com>
Co-authored-by: Jinpeng <zjpzlz@163.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
…pache#36649)

### Rationale for this change

In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader.

In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead. 

Using a bitmap instead (with one bit per column chunk indicating whether it's prebuffered or not) would be a reasonsable mitigation, taking 4KB for 32K columns.

### What changes are included in this PR?

Switch from a hash set to a bitmap buffer.

### Are these changes tested?

Yes, passed unit tests on partial prebuffer.

### Are there any user-facing changes?

No.

Lead-authored-by: jp0317 <zjpzlz@gmail.com>
Co-authored-by: Jinpeng <zjpzlz@163.com>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
…map multiple times (apache#36774)

### Rationale for this change

According to apache#36192 and apache#36649 . RowGroupReader using a bitmap to control a column-level prebuffer.

However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times.

### What changes are included in this PR?

Build `Prebuffer` Bitmap once, and reuse that vector.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#36773

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…map multiple times (apache#36774)

### Rationale for this change

According to apache#36192 and apache#36649 . RowGroupReader using a bitmap to control a column-level prebuffer.

However, if all columns are selected, this will be a heavy overhead for building a bitmap multiple times.

### What changes are included in this PR?

Build `Prebuffer` Bitmap once, and reuse that vector.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#36773

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants