Potentially buffer multiple `RecordBatches` before writing a parquet row group in `ArrowWriter` #1214

tustvold · 2022-01-20T19:02:06Z

Which issue does this PR close?

Closes #1211 .
Closes #1226

Rationale for this change

See ticket

What changes are included in this PR?

Changes ArrowWriter to produce row groups with max_row_group_size rows except for the final row group in the file.

Are there any user-facing changes?

Yes, ArrowWriter will now buffer up data prior to flush, producing larger batches in the process. This could be made an opt-in change, but I think this is probably what a lot of people, myself included, thought the writer did.

On a related note, I think the default max row group size is a tad high given it is used as a row threshold and not a bytes threshold - I've created #1213 to track this

parquet/src/arrow/arrow_writer.rs

tustvold · 2022-01-20T19:04:52Z

parquet/src/arrow/levels.rs

                let offsets = offsets
                    .to_vec()
                    .into_iter()
-                    .skip(offset)
+                    .skip(array.offset() + offset)


This was a pre-existing bug, that people would have run into if they wrote a sliced RecordBatch

Filed #1226 to track (so that we can document this in the release notes)

alamb · 2022-01-20T21:35:26Z

will try to review this tomorrow

tustvold · 2022-01-21T18:37:45Z

I opted to update the default max row group size, and clarify the docs as discussed in #1213

codecov-commenter · 2022-01-21T18:46:48Z

Codecov Report

Merging #1214 (326461b) into master (4b7afa6) will increase coverage by 0.22%.
The diff coverage is 95.88%.

@@            Coverage Diff             @@
##           master    #1214      +/-   ##
==========================================
+ Coverage   82.96%   83.18%   +0.22%     
==========================================
  Files         178      179       +1     
  Lines       51522    51950     +428     
==========================================
+ Hits        42744    43216     +472     
+ Misses       8778     8734      -44

Impacted Files	Coverage Δ
parquet/src/file/properties.rs	`95.74% <ø> (ø)`
parquet/src/arrow/arrow_writer.rs	`97.58% <95.67%> (-0.40%)`	⬇️
arrow/src/util/pretty.rs	`96.66% <100.00%> (ø)`
parquet/src/arrow/levels.rs	`84.67% <100.00%> (+0.10%)`	⬆️
arrow/src/datatypes/field.rs	`53.79% <0.00%> (-0.31%)`	⬇️
arrow/src/array/transform/mod.rs	`84.64% <0.00%> (-0.13%)`	⬇️
parquet_derive/src/parquet_field.rs	`66.21% <0.00%> (ø)`
arrow/src/compute/kernels/comparison.rs	`91.70% <0.00%> (ø)`
arrow/src/array/array_binary.rs	`93.65% <0.00%> (+0.17%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4b7afa6...326461b. Read the comment docs.

parquet/src/arrow/arrow_writer.rs

alamb

Looks good to me. Thank you @tustvold

I am a little worried about the performance hit of thetake invocation -- I wonder if it is possible to avoid in the common case?

Otherwise, looks good to me 👍 thank you

parquet/src/arrow/arrow_writer.rs

alamb · 2022-01-23T13:49:00Z

parquet/src/arrow/levels.rs

                let offsets = offsets
                    .to_vec()
                    .into_iter()
-                    .skip(offset)
+                    .skip(array.offset() + offset)


Filed #1226 to track (so that we can document this in the release notes)

parquet/src/arrow/arrow_writer.rs

tustvold · 2022-01-25T15:44:55Z

Moving back to draft as currently working on other things, will come back to this shortly

tustvold · 2022-01-31T23:25:59Z

I've pushed code that eliminates the use of concat, I'm fairly confident it is correct but I will write a few more tests before I mark this ready for review again

tustvold · 2022-02-01T09:53:51Z

arrow/src/util/pretty.rs

@@ -74,7 +74,7 @@ fn create_table(results: &[RecordBatch]) -> Result<Table> {
            let mut cells = Vec::new();
            for col in 0..batch.num_columns() {
                let column = batch.column(col);
-                cells.push(Cell::new(&array_value_to_string(&column, row)?));
+                cells.push(Cell::new(&array_value_to_string(column, row)?));


These changes are needed because clippy now finds this file as the prettyprint feature is enabled by parquet

alamb

I took a good look through this and it looks good to me 👍

alamb · 2022-02-01T14:27:08Z

cc @nevi-me should he want to review this as well

Batch multiple records in ArrowWriter

03cb4ca

github-actions bot added the parquet Changes to the parquet crate label Jan 20, 2022

tustvold commented Jan 20, 2022

View reviewed changes

parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved

tustvold commented Jan 20, 2022

View reviewed changes

Document max_group_size and reduce default (apache#1213)

9dbfdf2

alamb added the api-change Changes to the arrow API label Jan 23, 2022

alamb reviewed Jan 23, 2022

View reviewed changes

parquet/src/arrow/arrow_writer.rs Show resolved Hide resolved

alamb reviewed Jan 23, 2022

View reviewed changes

tustvold marked this pull request as draft January 25, 2022 15:44

tustvold added 4 commits January 31, 2022 21:26

Merge remote-tracking branch 'upstream/master' into arrow-writer-batch

1802ab9

Review feedback

65454bb

Write multiple arrays without concat

4185fa7

Clippy

e894fa8

Test aggregating complex types

e552b7a

github-actions bot added the arrow Changes to the arrow crate label Feb 1, 2022

tustvold commented Feb 1, 2022

View reviewed changes

Test complex slice

9d10c2d

tustvold marked this pull request as ready for review February 1, 2022 10:02

Clippy

326461b

alamb approved these changes Feb 1, 2022

View reviewed changes

alamb requested a review from nevi-me February 1, 2022 14:26

alamb merged commit 891b8d0 into apache:master Feb 1, 2022

tustvold mentioned this pull request Feb 3, 2022

Add parquet SQL benchmarks apache/datafusion#1738

Merged

alamb changed the title ~~Batch multiple records in ArrowWriter~~ Potentially buffer multiple RecordBatches before writing a parquet row group in ArrowWriter Feb 3, 2022

tustvold mentioned this pull request Feb 6, 2022

Provide an async ParquetWriter for arrow #1269

Closed

tustvold mentioned this pull request Apr 16, 2022

Add support for writing sliced arrays #225

Closed

xianwill mentioned this pull request Jun 20, 2022

feat: Upgrade to arrow/parquet 15 and datafusion 9 delta-io/delta-rs#652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially buffer multiple `RecordBatches` before writing a parquet row group in `ArrowWriter` #1214

Potentially buffer multiple `RecordBatches` before writing a parquet row group in `ArrowWriter` #1214

tustvold commented Jan 20, 2022 •

edited by alamb

Loading

tustvold Jan 20, 2022

alamb Jan 23, 2022

alamb commented Jan 20, 2022

tustvold commented Jan 21, 2022

codecov-commenter commented Jan 21, 2022 •

edited

Loading

alamb left a comment

alamb Jan 23, 2022

tustvold commented Jan 25, 2022

tustvold commented Jan 31, 2022

tustvold Feb 1, 2022

alamb left a comment

alamb commented Feb 1, 2022

Potentially buffer multiple RecordBatches before writing a parquet row group in ArrowWriter #1214

Potentially buffer multiple RecordBatches before writing a parquet row group in ArrowWriter #1214

Conversation

tustvold commented Jan 20, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Jan 20, 2022

Choose a reason for hiding this comment

alamb Jan 23, 2022

Choose a reason for hiding this comment

alamb commented Jan 20, 2022

tustvold commented Jan 21, 2022

codecov-commenter commented Jan 21, 2022 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 23, 2022

Choose a reason for hiding this comment

tustvold commented Jan 25, 2022

tustvold commented Jan 31, 2022

tustvold Feb 1, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Feb 1, 2022

Potentially buffer multiple `RecordBatches` before writing a parquet row group in `ArrowWriter` #1214

Potentially buffer multiple `RecordBatches` before writing a parquet row group in `ArrowWriter` #1214

tustvold commented Jan 20, 2022 •

edited by alamb

Loading

codecov-commenter commented Jan 21, 2022 •

edited

Loading