Separate metadata fetch from `ArrowReaderBuilder` construction (#4674) #4676

tustvold · 2023-08-10T11:41:31Z

Which issue does this PR close?

Related to #4674

Rationale for this change

This makes it easier to load the metadata once, and then use it to construct multiple readers

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-08-10T11:42:05Z

parquet/src/arrow/arrow_reader/mod.rs


 /// A synchronous builder used to construct [`ParquetRecordBatchReader`] for a file
 ///
 /// For an async API see [`crate::arrow::async_reader::ParquetRecordBatchStreamBuilder`]
 pub type ParquetRecordBatchReaderBuilder<T> = ArrowReaderBuilder<SyncReader<T>>;

-impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>> {
+impl<T: ChunkReader + 'static> ParquetRecordBatchReaderBuilder<T> {


This does not change the type, but improves the docs rendering as the methods will be shown for the typedef

tustvold · 2023-08-10T11:44:14Z

parquet/src/arrow/arrow_reader/mod.rs

 #[doc(hidden)]
 /// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async
-pub struct SyncReader<T: ChunkReader>(SerializedFileReader<T>);
+pub struct SyncReader<T: ChunkReader>(T);


Continuing to use SerializedFileReader would have meant adding APIs to load metadata and then construct it from said metadata. It seemed simpler to just use the lower level SerializedPageReader as is done by the async API. This does mean the arrow readers no longer make use of the file APIs, but I think this is fine

tustvold · 2023-08-10T11:46:09Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`]
+    ///
+    /// See [`Self::new_with_metadata`] for how this can be used
+    pub fn load_metadata(


I'm not totally sold on this method naming, nor whether it would be better to define a ArrowReaderMetadata::load and ArrowReaderMetadata::load_async.

I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Thus I like the idea of ArrowReaderMetadata::load and ArrowReaderMetadata::load_async but I don't feel super strongly about this -- maybe others do

I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Agree with this. Personally it would make more sense to me if it was on ParquetRecordBatchReaderBuilder but I also don't have strong feelings either way

to me if it was on ParquetRecordBatchReaderBuilder

Do you mean ArrowReaderMetadata?

Yeah, sorry that's what I meant.

RinChanNOWWW · 2023-08-10T12:33:46Z

LGTM. Thanks!

tustvold · 2023-08-10T13:10:21Z

Why the API change label? I don't believe this breaks any public APIs?

alamb

What is the context for this PR? I feel like there must be some backstory but it doesn't appear to be linked. Update: I found the related conversation #4674 and linked it here)

I think the code looks good and well documented to me.

I marked this PR as an api-change as I think it is -- let me know if you disagree

Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.

cc @Dandandan and @thinkharderdev who I think are knowledgable about the use of this functionality upstream in DataFusion

alamb · 2023-08-10T13:12:10Z

parquet/src/file/metadata.rs

@@ -155,13 +155,13 @@ impl ParquetMetaData {
    }

    /// Override the column index
-    #[allow(dead_code)]
+    #[cfg(feature = "arrow")]


parquet/src/file/serialized_reader.rs

alamb · 2023-08-10T13:15:38Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`]
+    ///
+    /// See [`Self::new_with_metadata`] for how this can be used
+    pub fn load_metadata(


I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Thus I like the idea of ArrowReaderMetadata::load and ArrowReaderMetadata::load_async but I don't feel super strongly about this -- maybe others do

alamb · 2023-08-10T13:18:41Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -234,48 +221,187 @@ impl ArrowReaderOptions {
    }
 }

+/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`]


Suggested change

/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`]

/// The metadata necessary to construct a [`ArrowReaderBuilder`].

///

/// This structure is inexpensive to clone.

tustvold · 2023-08-10T13:33:19Z

I marked this PR as an api-change as I think it is -- let me know if you disagree

Where is the breaking change you are refering to?

Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.

https://github.com/apache/arrow-rs/pull/4676/files#diff-850b3a44587149637b8545f66603a2b1252959fd36f7ddc55f37d6b5357816c6R374

?

alamb · 2023-08-10T13:38:25Z

I marked this PR as an api-change as I think it is -- let me know if you disagree

Where is the breaking change you are refering to?

I was thinking of https://github.com/apache/arrow-rs/pull/4676/files#r1289997977 but I suppose that since SyncReader isn't documented it is unlikely that anyone is using it directly 🤔

Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.

https://github.com/apache/arrow-rs/pull/4676/files#diff-850b3a44587149637b8545f66603a2b1252959fd36f7ddc55f37d6b5357816c6R374

As sorry I missed that.

tustvold · 2023-08-10T13:39:58Z

was thinking of https://github.com/apache/arrow-rs/pull/4676/files#r1289997977 but I suppose that since SyncReader isn't documented it is unlikely that anyone is using it directly thinking

The tuple field is not public, so cannot be used by external code

thinkharderdev

Makes sense to me

thinkharderdev · 2023-08-10T15:59:20Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`]
+    ///
+    /// See [`Self::new_with_metadata`] for how this can be used
+    pub fn load_metadata(


I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Agree with this. Personally it would make more sense to me if it was on ParquetRecordBatchReaderBuilder but I also don't have strong feelings either way

Separate metadata fetch from builder construction (apache#4674)

6b58549

github-actions bot added the parquet Changes to the parquet crate label Aug 10, 2023

tustvold commented Aug 10, 2023

View reviewed changes

tustvold added 2 commits August 10, 2023 12:47

Clippy

230c8a9

Docs tweaks

ec0636f

alamb added the api-change Changes to the arrow API label Aug 10, 2023

alamb changed the title ~~Separate metadata fetch from builder construction (#4674)~~ Separate metadata fetch from ArrowReaderBuilder construction (#4674) Aug 10, 2023

alamb approved these changes Aug 10, 2023

View reviewed changes

tustvold removed the api-change Changes to the arrow API label Aug 10, 2023

thinkharderdev approved these changes Aug 10, 2023

View reviewed changes

tustvold added 2 commits August 10, 2023 17:13

Wrap ParquetField in Arc

3dff40c

Move load to ArrowReaderMetadata

9c06d89

tustvold merged commit ea19ce8 into apache:master Aug 10, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate metadata fetch from `ArrowReaderBuilder` construction (#4674) #4676

Separate metadata fetch from `ArrowReaderBuilder` construction (#4674) #4676

tustvold commented Aug 10, 2023 •

edited by alamb

Loading

tustvold Aug 10, 2023 •

edited

Loading

tustvold Aug 10, 2023

tustvold Aug 10, 2023

alamb Aug 10, 2023

thinkharderdev Aug 10, 2023

tustvold Aug 10, 2023

thinkharderdev Aug 10, 2023

RinChanNOWWW commented Aug 10, 2023

tustvold commented Aug 10, 2023

alamb left a comment •

edited

Loading

alamb Aug 10, 2023

alamb Aug 10, 2023

alamb Aug 10, 2023

tustvold commented Aug 10, 2023

alamb commented Aug 10, 2023

tustvold commented Aug 10, 2023

thinkharderdev left a comment

thinkharderdev Aug 10, 2023

-/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`]
+/// The metadata necessary to construct a [`ArrowReaderBuilder`].
+///
+/// This structure is inexpensive to clone.

Separate metadata fetch from ArrowReaderBuilder construction (#4674) #4676

Separate metadata fetch from ArrowReaderBuilder construction (#4674) #4676

Conversation

tustvold commented Aug 10, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RinChanNOWWW commented Aug 10, 2023

tustvold commented Aug 10, 2023

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 10, 2023

alamb commented Aug 10, 2023

tustvold commented Aug 10, 2023

thinkharderdev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Separate metadata fetch from `ArrowReaderBuilder` construction (#4674) #4676

Separate metadata fetch from `ArrowReaderBuilder` construction (#4674) #4676

tustvold commented Aug 10, 2023 •

edited by alamb

Loading

tustvold Aug 10, 2023 •

edited

Loading

alamb left a comment •

edited

Loading