[C++][Parquet] Use bit vector to store Prebuffered column chunk index #43007

asfimport · 2023-07-12T18:06:13Z

In https://issues.apache.org/jira/browse/PARQUET-2316 we allow partial buffer in parquet File Reader by storing prebuffered column chunk index in a hash set, and make a copy of this hash set for each rowgroup reader

In extreme conditions where numerous columns are prebuffered and multiple rowgroup readers are created for the same row group , the hash set would incur significant overhead.

Using bit vector would be a reasonsable mitigation, taking 4KB for 32K columns.

Reporter: Jinpeng Zhou / @jp0317
Assignee: Jinpeng Zhou / @jp0317

PRs and other links:

GitHub Pull Request #36649

_{Note: This issue was originally created as PARQUET-2323. Please see the migration documentation for further details.}

asfimport · 2023-07-19T08:29:33Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 36649
#36649

asfimport · 2023-07-28T09:59:47Z

Raúl Cumplido / @raulcd:
@pitrou I don't seem to have permission to update the Fix Version to cpp-13.0.0, can you help me with that?

asfimport closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Use bit vector to store Prebuffered column chunk index #43007

[C++][Parquet] Use bit vector to store Prebuffered column chunk index #43007

asfimport commented Jul 12, 2023 •

edited

Loading

asfimport commented Jul 19, 2023

asfimport commented Jul 28, 2023

[C++][Parquet] Use bit vector to store Prebuffered column chunk index #43007

[C++][Parquet] Use bit vector to store Prebuffered column chunk index #43007

Comments

asfimport commented Jul 12, 2023 • edited Loading

PRs and other links:

asfimport commented Jul 19, 2023

asfimport commented Jul 28, 2023

asfimport commented Jul 12, 2023 •

edited

Loading