-
Notifications
You must be signed in to change notification settings - Fork 770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate metadata fetch from ArrowReaderBuilder
construction (#4674)
#4676
Conversation
|
||
/// A synchronous builder used to construct [`ParquetRecordBatchReader`] for a file | ||
/// | ||
/// For an async API see [`crate::arrow::async_reader::ParquetRecordBatchStreamBuilder`] | ||
pub type ParquetRecordBatchReaderBuilder<T> = ArrowReaderBuilder<SyncReader<T>>; | ||
|
||
impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>> { | ||
impl<T: ChunkReader + 'static> ParquetRecordBatchReaderBuilder<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[doc(hidden)] | ||
/// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async | ||
pub struct SyncReader<T: ChunkReader>(SerializedFileReader<T>); | ||
pub struct SyncReader<T: ChunkReader>(T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing to use SerializedFileReader would have meant adding APIs to load metadata and then construct it from said metadata. It seemed simpler to just use the lower level SerializedPageReader as is done by the async API. This does mean the arrow readers no longer make use of the file APIs, but I think this is fine
/// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`] | ||
/// | ||
/// See [`Self::new_with_metadata`] for how this can be used | ||
pub fn load_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sold on this method naming, nor whether it would be better to define a ArrowReaderMetadata::load
and ArrowReaderMetadata::load_async
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata
struct rather than on the related (but different) ParquetRecordBatchReaderBuilder
etc
Thus I like the idea of ArrowReaderMetadata::load
and ArrowReaderMetadata::load_async
but I don't feel super strongly about this -- maybe others do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc
Agree with this. Personally it would make more sense to me if it was on ParquetRecordBatchReaderBuilder
but I also don't have strong feelings either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me if it was on ParquetRecordBatchReaderBuilder
Do you mean ArrowReaderMetadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry that's what I meant.
LGTM. Thanks! |
ArrowReaderBuilder
construction (#4674)
Why the API change label? I don't believe this breaks any public APIs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the context for this PR? I feel like there must be some backstory but it doesn't appear to be linked. Update: I found the related conversation #4674 and linked it here)
I think the code looks good and well documented to me.
I marked this PR as an api-change
as I think it is -- let me know if you disagree
Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.
cc @Dandandan and @thinkharderdev who I think are knowledgable about the use of this functionality upstream in DataFusion
@@ -155,13 +155,13 @@ impl ParquetMetaData { | |||
} | |||
|
|||
/// Override the column index | |||
#[allow(dead_code)] | |||
#[cfg(feature = "arrow")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`] | ||
/// | ||
/// See [`Self::new_with_metadata`] for how this can be used | ||
pub fn load_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata
struct rather than on the related (but different) ParquetRecordBatchReaderBuilder
etc
Thus I like the idea of ArrowReaderMetadata::load
and ArrowReaderMetadata::load_async
but I don't feel super strongly about this -- maybe others do
@@ -234,48 +221,187 @@ impl ArrowReaderOptions { | |||
} | |||
} | |||
|
|||
/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`] | |
/// The metadata necessary to construct a [`ArrowReaderBuilder`]. | |
/// | |
/// This structure is inexpensive to clone. |
Where is the breaking change you are refering to?
? |
I was thinking of https://github.com/apache/arrow-rs/pull/4676/files#r1289997977 but I suppose that since
As sorry I missed that. |
The tuple field is not public, so cannot be used by external code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
/// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`] | ||
/// | ||
/// See [`Self::new_with_metadata`] for how this can be used | ||
pub fn load_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc
Agree with this. Personally it would make more sense to me if it was on ParquetRecordBatchReaderBuilder
but I also don't have strong feelings either way
Which issue does this PR close?
Related to #4674
Rationale for this change
This makes it easier to load the metadata once, and then use it to construct multiple readers
What changes are included in this PR?
Are there any user-facing changes?