-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36280: [Parquet][C++] FileWriter supports WriteTable in the buffered mode #36377
base: main
Are you sure you want to change the base?
Conversation
@jorisvandenbossche @pitrou Mind take a look? WriteTable with buffering is added now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, most comments are addressing readability
parquet::arrow::FileWriter
support table buffered write….com/mapleFU/arrow into parquet/support-table-buffered-write
I guess this would also closes: #31405 |
@pitrou would you mind take a look at this? This add a buffering mode for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
int column_index_start = 0; | ||
|
||
for (int i = 0; i < batch.num_columns(); i++) { | ||
std::shared_ptr<ChunkedArray> chunked_array = GetColumnChunkedArray(batch, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit wasteful to call this for each chunk. Why not use a RecordBatchReader
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::shared_ptr<ChunkedArray> GetColumnChunkedArray(const ::arrow::RecordBatch& value,
int column_id) {
return std::make_shared<::arrow::ChunkedArray>(value.column(column_id));
}
std::shared_ptr<ChunkedArray> GetColumnChunkedArray(const ::arrow::Table& value,
int column_id) {
return value.column(column_id);
}
Seems that here I want to call ArrowColumnWriterV2::Make
, which requires a ChunkedArray
, so maybe I'm not sure why this is expensive, and how RecordBatch
can handling this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then you can simply call GetColumnChunkedArray(batch, i)
outside of WriteBatch
. It does not depend on offset/size.
@@ -5024,6 +5024,41 @@ class TestBufferedParquetIO : public TestParquetIO<TestType> { | |||
ASSERT_OK_NO_THROW(writer->Close()); | |||
} | |||
|
|||
void WriteBufferedTable(const std::shared_ptr<Array>& values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please refactor this with WriteBufferedFile
to avoid copy-pasting entire chunks of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I guess I use some different logic here, I'll try to unify them
} | ||
|
||
TYPED_TEST(TestBufferedParquetIO, WriteTableLarge) { | ||
std::shared_ptr<Array> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also void repeating yourself in the tests below?
this->WriteBufferedTable(values, write_table_batch_size, write_table_max_row_group_size, | ||
write_max_row_group_size, &num_row_groups); | ||
EXPECT_EQ(1, num_row_groups); | ||
ASSERT_NO_FATAL_FAILURE(this->ReadAndCheckSingleColumnFile(*values, num_row_groups)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there tests with more than one column somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'll add it
Rationale for this change
Allow a buffered
WriteTable
What changes are included in this PR?
use_buffering
inWriteTable
WriteBuffered
inparquet::arrow::FileWriterImpl
, and move logic ofparquet::arrow::FileWriterImpl::WriteRecordBatch
to it.Are these changes tested?
Yes
Are there any user-facing changes?
User can write table with buffered row-group
write_batch
to directly write batch #36280