Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset #39965

Closed
ion-elgreco opened this issue Feb 6, 2024 · 9 comments · Fixed by #39995

Comments

@ion-elgreco
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

The pyarrow.dataset.write_dataset function leaves an empty row_group behind in the the parquet file after the writer hits the limit of max_rows_per_file. See reproducible example below:

import pyarrow.dataset as ds
import pyarrow.parquet as pq
ds.write_dataset(data, "test_dataset", max_rows_per_file=1024*32, max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, format='parquet')
metadata = pq.read_metadata("test_dataset/part-0.parquet")
for i in range(metadata.num_row_groups):
    print(metadata.row_group(i))

<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5168180>
  num_columns: 1
  num_rows: 16384
  total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
  num_columns: 1
  num_rows: 16384
  total_byte_size: 61
<pyarrow._parquet.RowGroupMetaData object at 0x7f92b5185bc0>
  num_columns: 1
  num_rows: 0
  total_byte_size: 14

Component(s)

Python

@ion-elgreco ion-elgreco changed the title Empty row groups left behind after hitting max_rows_per_file [Python][parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset Feb 6, 2024
@AlenkaF
Copy link
Member

AlenkaF commented Feb 7, 2024

Thank you for opening up an issue @ion-elgreco!

I agree this is probably a bug in datasets. It would be in Arrow C++ I believe. @mapleFU would you mind sharing your view on this?

@mapleFU
Copy link
Member

mapleFU commented Feb 7, 2024

@ion-elgreco What's the version of arrow you're now using? I guess that's from bad handling of max_rows_per_group 🤔

@ion-elgreco
Copy link
Author

ion-elgreco commented Feb 7, 2024

@mapleFU it occurs with the latest one, so v15

We figured this out during this issue, a user mentioned he tried it on v15 up to v9: delta-io/delta-rs#2169 (comment)

@mapleFU
Copy link
Member

mapleFU commented Feb 7, 2024

I think that's a problem in dataset writer, and the problem is not related to core parquet lib, I'll try to reproduce and fix it tomorrow

@mapleFU
Copy link
Member

mapleFU commented Feb 7, 2024

Update: dataset writer DoWriteRecordBatch could cause this problem currently, I can fix it tomorrow, but currently I don't know if other place would causing this 🤔

@kou kou changed the title [Python][parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset [Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset Feb 8, 2024
@mapleFU
Copy link
Member

mapleFU commented Feb 8, 2024

@ion-elgreco Would you mind try: #39995 ? I guess this might fix the problem

@ion-elgreco
Copy link
Author

@mapleFU is there some guide I can follow to compile from source? Never touched any C++ codebase before

@mapleFU
Copy link
Member

mapleFU commented Feb 8, 2024

https://arrow.apache.org/docs/developers/python.html
It's a bit complex because it need to build that

@ion-elgreco
Copy link
Author

@mapleFU ok I need to find some time to set this up then, I'll try it out in the weekend

mapleFU added a commit that referenced this issue Feb 23, 2024
…ax_rows_per_file` enabled (#39995)

### Rationale for this change

`DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data.

### What changes are included in this PR?

Check batch-size == 0 when append to file queue

### Are these changes tested?

Yes

### Are there any user-facing changes?

User can avoid zero-sized row-group/batch.

* Closes: #39965

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added this to the 16.0.0 milestone Feb 23, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…hen `max_rows_per_file` enabled (apache#39995)

### Rationale for this change

`DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data.

### What changes are included in this PR?

Check batch-size == 0 when append to file queue

### Are these changes tested?

Yes

### Are there any user-facing changes?

User can avoid zero-sized row-group/batch.

* Closes: apache#39965

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…hen `max_rows_per_file` enabled (apache#39995)

### Rationale for this change

`DatasetWriter` might create empty `RecordBatch` when `max_rows_per_file` enabled. This is because `NextWritableChunk` might return a zero-sized batch when the file exactly contains the dest data.

### What changes are included in this PR?

Check batch-size == 0 when append to file queue

### Are these changes tested?

Yes

### Are there any user-facing changes?

User can avoid zero-sized row-group/batch.

* Closes: apache#39965

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants