Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writer v2: clear buffer after page flush #3447

Merged
merged 6 commits into from
Jan 4, 2023
Merged

Parquet writer v2: clear buffer after page flush #3447

merged 6 commits into from
Jan 4, 2023

Conversation

askoa
Copy link
Contributor

@askoa askoa commented Jan 4, 2023

Which issue does this PR close?

Closes #3408

Rationale for this change

What changes are included in this PR?

Please see #3408 (comment) for explanation of issue resolution.

Update: The issue described below is fixed. The PR is ready for review.

Expand to see issue The issue is not completely fixed. I added a test `fallback_flush_data_page` and marked it as `ignore` as its failing. I included the difference before and after change. We can see from the diff that, before the change, the values are garbage after 32 (which is the page size). After change, there is an issue between 33-39. The values match after 39.

As I am not acquainted with parquet format, it might take some time for me to analyze this. If anyone else want to analyze then feel free to go ahead.

cc @tustvold @alamb

difference before change:

running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
  left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`,
 right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "0", "01", "02", "03", "04", "05", "06", "07", "89", "81", "80", "81", "82", "83", "84", "85", "0", "01", "23", "24", "25", "26", "27", "28", "29", "21", "20", "21", "23", "24", "25"]`', parquet/src/arrow/arrow_writer/mod.rs:1887:21

difference after change:

running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
  left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`,
 right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "2", "23", "24", "25", "26", "27", "28", "29", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`', parquet/src/arrow/arrow_writer/mod.rs:1889:21

Are there any user-facing changes?

Users will see the issue fixed. No breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 4, 2023
@askoa askoa marked this pull request as ready for review January 4, 2023 02:12
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had some questions about testing, nice one tracking this down 👍

.build();

roundtrip_opts_with_array_validation(&expected_batch, props, |a, b| {
let string_array_a = StringArray::from(a.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just compare the ArrayData for equality directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array data comparison output is in binary format and not easily comprehendible. I converted it to string comparison so that I can see the difference like below.

running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
  left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`,
 right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "2", "23", "24", "25", "26", "27", "28", "29", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`', parquet/src/arrow/arrow_writer/mod.rs:1889:21

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be simplified to assert_eq(string_array_a, string_array_b) then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be simplified to assert_eq(string_array_a, string_array_b) then?

I just tried to compare string_array locally and it prints output in a different format. I prefer the output comparing Vecs

running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
  left: `StringArray
[
  "0",
  "1",
  "2",
  "3",
  "4",
  "5",
  "6",
  "7",
  "8",
  "9",
  ...43 elements...,
  "53",
  "54",
  "55",
  "56",
  "57",
  "58",
  "59",
  "60",
  "61",
  "62",
]`,
 right: `StringArray
[
  "0",
  "1",
  "2",
  "3",
  "4",
  "5",
  "6",
  "7",
  "8",
  "9",
  ...43 elements...,
  "53",
  "54",
  "55",
  "56",
  "57",
  "58",
  "59",
  "60",
  "61",
  "62",
]`: failed for encoder: DELTA_BYTE_ARRAY and row_group_size: 1024', parquet/src/arrow/arrow_writer/mod.rs:1885:21

@@ -1199,7 +1200,14 @@ mod tests {
files
}

fn roundtrip_opts(expected_batch: &RecordBatch, props: WriterProperties) -> File {
fn roundtrip_opts_with_array_validation<F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the need for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to compare the underlying strings instead of ArrayData and hence new function which takes comparison function as input.

@tustvold tustvold merged commit 61a77a5 into apache:master Jan 4, 2023
@ursabot
Copy link

ursabot commented Jan 4, 2023

Benchmark runs are scheduled for baseline = 65ff80e and contender = 61a77a5. 61a77a5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Jan 4, 2023

Thank you @askoa

@askoa askoa deleted the parquet-encoder branch January 12, 2023 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet-fromcsv with writer version v2 does not stop
4 participants