Parquet writer v2: clear buffer after page flush #3447

askoa · 2023-01-04T01:27:25Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Please see #3408 (comment) for explanation of issue resolution.

Update: The issue described below is fixed. The PR is ready for review.

Expand to see issue

The issue is not completely fixed. I added a test `fallback_flush_data_page` and marked it as `ignore` as its failing. I included the difference before and after change. We can see from the diff that, before the change, the values are garbage after 32 (which is the page size). After change, there is an issue between 33-39. The values match after 39.

As I am not acquainted with parquet format, it might take some time for me to analyze this. If anyone else want to analyze then feel free to go ahead.

cc @tustvold @alamb

difference before change:

running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
  left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`,
 right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "0", "01", "02", "03", "04", "05", "06", "07", "89", "81", "80", "81", "82", "83", "84", "85", "0", "01", "23", "24", "25", "26", "27", "28", "29", "21", "20", "21", "23", "24", "25"]`', parquet/src/arrow/arrow_writer/mod.rs:1887:21

difference after change:

running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
  left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`,
 right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "2", "23", "24", "25", "26", "27", "28", "29", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`', parquet/src/arrow/arrow_writer/mod.rs:1889:21

Are there any user-facing changes?

Users will see the issue fixed. No breaking change.

tustvold

Just had some questions about testing, nice one tracking this down 👍

tustvold · 2023-01-04T08:49:09Z

parquet/src/arrow/arrow_writer/mod.rs

+                    .build();
+
+                roundtrip_opts_with_array_validation(&expected_batch, props, |a, b| {
+                    let string_array_a = StringArray::from(a.clone());


Why not just compare the ArrayData for equality directly?

Array data comparison output is in binary format and not easily comprehendible. I converted it to string comparison so that I can see the difference like below.

running 1 test thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)` left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`, right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "2", "23", "24", "25", "26", "27", "28", "29", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`', parquet/src/arrow/arrow_writer/mod.rs:1889:21

Perhaps this could be simplified to assert_eq(string_array_a, string_array_b) then?

Perhaps this could be simplified to assert_eq(string_array_a, string_array_b) then?

I just tried to compare string_array locally and it prints output in a different format. I prefer the output comparing Vecs

running 1 test thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)` left: `StringArray [ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ...43 elements..., "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", ]`, right: `StringArray [ "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ...43 elements..., "53", "54", "55", "56", "57", "58", "59", "60", "61", "62", ]`: failed for encoder: DELTA_BYTE_ARRAY and row_group_size: 1024', parquet/src/arrow/arrow_writer/mod.rs:1885:21

tustvold · 2023-01-04T08:51:08Z

parquet/src/arrow/arrow_writer/mod.rs

@@ -1199,7 +1200,14 @@ mod tests {
        files
    }

-    fn roundtrip_opts(expected_batch: &RecordBatch, props: WriterProperties) -> File {
+    fn roundtrip_opts_with_array_validation<F>(


I'm not sure I understand the need for this?

I wanted to compare the underlying strings instead of ArrayData and hence new function which takes comparison function as input.

ursabot · 2023-01-04T13:31:31Z

Benchmark runs are scheduled for baseline = 65ff80e and contender = 61a77a5. 61a77a5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2023-01-04T18:41:45Z

Thank you @askoa

github-actions bot added the parquet Changes to the parquet crate label Jan 4, 2023

askoa added 4 commits January 3, 2023 20:42

parquet writer v2: clear buffer after page flush`

9570965

fix clippy issue

3e9cb0b

fmt fix

036a558

fixed issue with flush_page

c89b096

askoa marked this pull request as ready for review January 4, 2023 02:12

askoa added 2 commits January 3, 2023 21:12

fmt fix

6885848

fix clippy errors

37eac8b

tustvold approved these changes Jan 4, 2023

View reviewed changes

tustvold merged commit 61a77a5 into apache:master Jan 4, 2023

askoa deleted the parquet-encoder branch January 12, 2023 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet writer v2: clear buffer after page flush #3447

Parquet writer v2: clear buffer after page flush #3447

askoa commented Jan 4, 2023 •

edited

Loading

tustvold left a comment

tustvold Jan 4, 2023

askoa Jan 4, 2023

tustvold Jan 4, 2023

askoa Jan 4, 2023

tustvold Jan 4, 2023

askoa Jan 4, 2023

ursabot commented Jan 4, 2023

alamb commented Jan 4, 2023

Parquet writer v2: clear buffer after page flush #3447

Parquet writer v2: clear buffer after page flush #3447

Conversation

askoa commented Jan 4, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Update: The issue described below is fixed. The PR is ready for review.

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 4, 2023

Choose a reason for hiding this comment

askoa Jan 4, 2023

Choose a reason for hiding this comment

tustvold Jan 4, 2023

Choose a reason for hiding this comment

askoa Jan 4, 2023

Choose a reason for hiding this comment

tustvold Jan 4, 2023

Choose a reason for hiding this comment

askoa Jan 4, 2023

Choose a reason for hiding this comment

ursabot commented Jan 4, 2023

alamb commented Jan 4, 2023

askoa commented Jan 4, 2023 •

edited

Loading