Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit ArrowWriter Row Group Size by bytes in addition to rows #1213

Closed
tustvold opened this issue Jan 20, 2022 · 1 comment
Closed

Limit ArrowWriter Row Group Size by bytes in addition to rows #1213

tustvold opened this issue Jan 20, 2022 · 1 comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrowWriter uses max_row_group_size as a row count limit. Whilst this is significantly simpler to implement, it is at odds with other arrow implementations that use a bytes threshold.

Describe the solution you'd like

Any or all of:

  • Clearly document what max_row_group_size is used for and how it is different from the other size quantities in WriterProperties
  • Assess if the DEFAULT_MAX_ROW_GROUP_SIZE of 128 * 1024 * 1024 makes sense given this is not bytes
  • Add functionality to flush based on a bytes threshold instead of, or in addition to, the current row threshold
@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jan 20, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Jan 21, 2022
@alamb alamb changed the title ArrowWriter Row Group Byte Size Limit Limit ArrowWriter Row Group Size by bytes in addition to rows Jan 23, 2022
@alamb alamb added the parquet Changes to the parquet crate label Jan 23, 2022
alamb pushed a commit that referenced this issue Feb 1, 2022
* Batch multiple records in ArrowWriter

* Document max_group_size and reduce default (#1213)

* Review feedback

* Write multiple arrays without concat

* Clippy

* Test aggregating complex types

* Test complex slice

* Clippy
@tustvold
Copy link
Contributor Author

tustvold commented Jun 1, 2023

Closed by #4280

@tustvold tustvold closed this as completed Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants