Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially buffer multiple RecordBatches before writing a parquet row group in ArrowWriter #1214

Merged
merged 9 commits into from
Feb 1, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 20, 2022

Which issue does this PR close?

Closes #1211 .
Closes #1226

Rationale for this change

See ticket

What changes are included in this PR?

Changes ArrowWriter to produce row groups with max_row_group_size rows except for the final row group in the file.

Are there any user-facing changes?

Yes, ArrowWriter will now buffer up data prior to flush, producing larger batches in the process. This could be made an opt-in change, but I think this is probably what a lot of people, myself included, thought the writer did.

On a related note, I think the default max row group size is a tad high given it is used as a row threshold and not a bytes threshold - I've created #1213 to track this

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 20, 2022
let offsets = offsets
.to_vec()
.into_iter()
.skip(offset)
.skip(array.offset() + offset)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pre-existing bug, that people would have run into if they wrote a sliced RecordBatch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #1226 to track (so that we can document this in the release notes)

@alamb
Copy link
Contributor

alamb commented Jan 20, 2022

will try to review this tomorrow

@tustvold
Copy link
Contributor Author

I opted to update the default max row group size, and clarify the docs as discussed in #1213

@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2022

Codecov Report

Merging #1214 (326461b) into master (4b7afa6) will increase coverage by 0.22%.
The diff coverage is 95.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1214      +/-   ##
==========================================
+ Coverage   82.96%   83.18%   +0.22%     
==========================================
  Files         178      179       +1     
  Lines       51522    51950     +428     
==========================================
+ Hits        42744    43216     +472     
+ Misses       8778     8734      -44     
Impacted Files Coverage Δ
parquet/src/file/properties.rs 95.74% <ø> (ø)
parquet/src/arrow/arrow_writer.rs 97.58% <95.67%> (-0.40%) ⬇️
arrow/src/util/pretty.rs 96.66% <100.00%> (ø)
parquet/src/arrow/levels.rs 84.67% <100.00%> (+0.10%) ⬆️
arrow/src/datatypes/field.rs 53.79% <0.00%> (-0.31%) ⬇️
arrow/src/array/transform/mod.rs 84.64% <0.00%> (-0.13%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (ø)
arrow/src/compute/kernels/comparison.rs 91.70% <0.00%> (ø)
arrow/src/array/array_binary.rs 93.65% <0.00%> (+0.17%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b7afa6...326461b. Read the comment docs.

@alamb alamb added the api-change Changes to the arrow API label Jan 23, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you @tustvold

I am a little worried about the performance hit of thetake invocation -- I wonder if it is possible to avoid in the common case?

Otherwise, looks good to me 👍 thank you

parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
let offsets = offsets
.to_vec()
.into_iter()
.skip(offset)
.skip(array.offset() + offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #1226 to track (so that we can document this in the release notes)

parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
@tustvold tustvold marked this pull request as draft January 25, 2022 15:44
@tustvold
Copy link
Contributor Author

Moving back to draft as currently working on other things, will come back to this shortly

@tustvold
Copy link
Contributor Author

I've pushed code that eliminates the use of concat, I'm fairly confident it is correct but I will write a few more tests before I mark this ready for review again

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 1, 2022
@@ -74,7 +74,7 @@ fn create_table(results: &[RecordBatch]) -> Result<Table> {
let mut cells = Vec::new();
for col in 0..batch.num_columns() {
let column = batch.column(col);
cells.push(Cell::new(&array_value_to_string(&column, row)?));
cells.push(Cell::new(&array_value_to_string(column, row)?));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are needed because clippy now finds this file as the prettyprint feature is enabled by parquet

@tustvold tustvold marked this pull request as ready for review February 1, 2022 10:02
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a good look through this and it looks good to me 👍

@alamb alamb requested a review from nevi-me February 1, 2022 14:26
@alamb
Copy link
Contributor

alamb commented Feb 1, 2022

cc @nevi-me should he want to review this as well

@alamb alamb merged commit 891b8d0 into apache:master Feb 1, 2022
@alamb alamb changed the title Batch multiple records in ArrowWriter Potentially buffer multiple RecordBatches before writing a parquet row group in ArrowWriter Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing sliced ListArray or MapArray ignore offsets Write Multiple RecordBatch to Parquet Row Group
3 participants