Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39965: [C++] DatasetWriter avoid creating zero-sized batch when max_rows_per_file enabled #39995

Merged
merged 8 commits into from
Feb 23, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Feb 8, 2024

Rationale for this change

DatasetWriter might create empty RecordBatch when max_rows_per_file enabled. This is because NextWritableChunk might return a zero-sized batch when the file exactly contains the dest data.

What changes are included in this PR?

Check batch-size == 0 when append to file queue

Are these changes tested?

Yes

Are there any user-facing changes?

User can avoid zero-sized row-group/batch.

Copy link

github-actions bot commented Feb 8, 2024

⚠️ GitHub issue #39965 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 8, 2024

cc @bkietz @kou

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 8, 2024
@mapleFU mapleFU requested a review from bkietz February 8, 2024 17:48
cpp/src/arrow/dataset/dataset_writer.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_writer.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 13, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Feb 13, 2024

@bkietz comment resolved

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 13, 2024
@mapleFU mapleFU requested a review from bkietz February 15, 2024 12:27
@mapleFU
Copy link
Member Author

mapleFU commented Feb 21, 2024

Gentle ping @kou @bkietz for help...

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -198,9 +198,7 @@ class DatasetWriterTestFixture : public testing::Test {
int num_batches = 0;
AssertBatchesEqual(*MakeBatch(expected_file.start, expected_file.num_rows),
*ReadAsBatch(written_file->data, &num_batches));
if (check_num_record_batches) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove bool check_num_record_batches = true argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is added in my previous patch #38885 . Here if not passing check_num_record_batches = false, the check would failed because zero-sized batch.

In this patch, it would not be produced, so I can remove it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Please remove it before we merge this.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Feb 22, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Feb 22, 2024

Will merge if no negative comment tommorrow.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Feb 22, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 22, 2024
@mapleFU mapleFU merged commit 5f75dbf into apache:main Feb 23, 2024
33 of 34 checks passed
@mapleFU mapleFU removed the awaiting change review Awaiting change review label Feb 23, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5f75dbf.

There were 10 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…hen `max_rows_per_file` enabled (apache#39995)

### Rationale for this change

`DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data.

### What changes are included in this PR?

Check batch-size == 0 when append to file queue

### Are these changes tested?

Yes

### Are there any user-facing changes?

User can avoid zero-sized row-group/batch.

* Closes: apache#39965

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…hen `max_rows_per_file` enabled (apache#39995)

### Rationale for this change

`DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data.

### What changes are included in this PR?

Check batch-size == 0 when append to file queue

### Are these changes tested?

Yes

### Are there any user-facing changes?

User can avoid zero-sized row-group/batch.

* Closes: apache#39965

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset
3 participants