-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39965: [C++] DatasetWriter avoid creating zero-sized batch when max_rows_per_file
enabled
#39995
Conversation
|
@bkietz comment resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -198,9 +198,7 @@ class DatasetWriterTestFixture : public testing::Test { | |||
int num_batches = 0; | |||
AssertBatchesEqual(*MakeBatch(expected_file.start, expected_file.num_rows), | |||
*ReadAsBatch(written_file->data, &num_batches)); | |||
if (check_num_record_batches) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove bool check_num_record_batches = true
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is added in my previous patch #38885 . Here if not passing check_num_record_batches = false
, the check would failed because zero-sized batch.
In this patch, it would not be produced, so I can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Please remove it before we merge this.
Will merge if no negative comment tommorrow. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5f75dbf. There were 10 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…hen `max_rows_per_file` enabled (apache#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: apache#39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…hen `max_rows_per_file` enabled (apache#39995) ### Rationale for this change `DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data. ### What changes are included in this PR? Check batch-size == 0 when append to file queue ### Are these changes tested? Yes ### Are there any user-facing changes? User can avoid zero-sized row-group/batch. * Closes: apache#39965 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
DatasetWriter
might create emptyRecordBatch
whenmax_rows_per_file
enabled. This is becauseNextWritableChunk
might return a zero-sized batch when the file exactly contains the dest data.What changes are included in this PR?
Check batch-size == 0 when append to file queue
Are these changes tested?
Yes
Are there any user-facing changes?
User can avoid zero-sized row-group/batch.
ds.write_dataset
#39965