-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet writer v2: clear buffer after page flush #3447
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had some questions about testing, nice one tracking this down 👍
.build(); | ||
|
||
roundtrip_opts_with_array_validation(&expected_batch, props, |a, b| { | ||
let string_array_a = StringArray::from(a.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just compare the ArrayData for equality directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array data comparison output is in binary format and not easily comprehendible. I converted it to string comparison so that I can see the difference like below.
running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
left: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`,
right: `["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "2", "23", "24", "25", "26", "27", "28", "29", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61", "62"]`', parquet/src/arrow/arrow_writer/mod.rs:1889:21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could be simplified to assert_eq(string_array_a, string_array_b)
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could be simplified to
assert_eq(string_array_a, string_array_b)
then?
I just tried to compare string_array locally and it prints output in a different format. I prefer the output comparing Vec
s
running 1 test
thread 'arrow::arrow_writer::tests::fallback_flush_data_page' panicked at 'assertion failed: `(left == right)`
left: `StringArray
[
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
...43 elements...,
"53",
"54",
"55",
"56",
"57",
"58",
"59",
"60",
"61",
"62",
]`,
right: `StringArray
[
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
...43 elements...,
"53",
"54",
"55",
"56",
"57",
"58",
"59",
"60",
"61",
"62",
]`: failed for encoder: DELTA_BYTE_ARRAY and row_group_size: 1024', parquet/src/arrow/arrow_writer/mod.rs:1885:21
@@ -1199,7 +1200,14 @@ mod tests { | |||
files | |||
} | |||
|
|||
fn roundtrip_opts(expected_batch: &RecordBatch, props: WriterProperties) -> File { | |||
fn roundtrip_opts_with_array_validation<F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the need for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to compare the underlying strings instead of ArrayData and hence new function which takes comparison function as input.
Benchmark runs are scheduled for baseline = 65ff80e and contender = 61a77a5. 61a77a5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thank you @askoa |
Which issue does this PR close?
Closes #3408
Rationale for this change
What changes are included in this PR?
Please see #3408 (comment) for explanation of issue resolution.
Update: The issue described below is fixed. The PR is ready for review.
Expand to see issue
The issue is not completely fixed. I added a test `fallback_flush_data_page` and marked it as `ignore` as its failing. I included the difference before and after change. We can see from the diff that, before the change, the values are garbage after 32 (which is the page size). After change, there is an issue between 33-39. The values match after 39.As I am not acquainted with parquet format, it might take some time for me to analyze this. If anyone else want to analyze then feel free to go ahead.
cc @tustvold @alamb
difference before change:
difference after change:
Are there any user-facing changes?
Users will see the issue fixed. No breaking change.