Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate metadata fetch from ArrowReaderBuilder construction (#4674) #4676

Merged
merged 5 commits into from
Aug 10, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 10, 2023

Which issue does this PR close?

Related to #4674

Rationale for this change

This makes it easier to load the metadata once, and then use it to construct multiple readers

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 10, 2023

/// A synchronous builder used to construct [`ParquetRecordBatchReader`] for a file
///
/// For an async API see [`crate::arrow::async_reader::ParquetRecordBatchStreamBuilder`]
pub type ParquetRecordBatchReaderBuilder<T> = ArrowReaderBuilder<SyncReader<T>>;

impl<T: ChunkReader + 'static> ArrowReaderBuilder<SyncReader<T>> {
impl<T: ChunkReader + 'static> ParquetRecordBatchReaderBuilder<T> {
Copy link
Contributor Author

@tustvold tustvold Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not change the type, but improves the docs rendering as the methods will be shown for the typedef

image

#[doc(hidden)]
/// A newtype used within [`ReaderOptionsBuilder`] to distinguish sync readers from async
pub struct SyncReader<T: ChunkReader>(SerializedFileReader<T>);
pub struct SyncReader<T: ChunkReader>(T);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to use SerializedFileReader would have meant adding APIs to load metadata and then construct it from said metadata. It seemed simpler to just use the lower level SerializedPageReader as is done by the async API. This does mean the arrow readers no longer make use of the file APIs, but I think this is fine

/// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`]
///
/// See [`Self::new_with_metadata`] for how this can be used
pub fn load_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sold on this method naming, nor whether it would be better to define a ArrowReaderMetadata::load and ArrowReaderMetadata::load_async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Thus I like the idea of ArrowReaderMetadata::load and ArrowReaderMetadata::load_async but I don't feel super strongly about this -- maybe others do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Agree with this. Personally it would make more sense to me if it was on ParquetRecordBatchReaderBuilder but I also don't have strong feelings either way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me if it was on ParquetRecordBatchReaderBuilder

Do you mean ArrowReaderMetadata?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry that's what I meant.

@RinChanNOWWW
Copy link
Contributor

LGTM. Thanks!

@alamb alamb added the api-change Changes to the arrow API label Aug 10, 2023
@alamb alamb changed the title Separate metadata fetch from builder construction (#4674) Separate metadata fetch from ArrowReaderBuilder construction (#4674) Aug 10, 2023
@tustvold
Copy link
Contributor Author

Why the API change label? I don't believe this breaks any public APIs?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the context for this PR? I feel like there must be some backstory but it doesn't appear to be linked. Update: I found the related conversation #4674 and linked it here)

I think the code looks good and well documented to me.

I marked this PR as an api-change as I think it is -- let me know if you disagree

Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.

cc @Dandandan and @thinkharderdev who I think are knowledgable about the use of this functionality upstream in DataFusion

@@ -155,13 +155,13 @@ impl ParquetMetaData {
}

/// Override the column index
#[allow(dead_code)]
#[cfg(feature = "arrow")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

parquet/src/file/serialized_reader.rs Show resolved Hide resolved
/// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`]
///
/// See [`Self::new_with_metadata`] for how this can be used
pub fn load_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Thus I like the idea of ArrowReaderMetadata::load and ArrowReaderMetadata::load_async but I don't feel super strongly about this -- maybe others do

@@ -234,48 +221,187 @@ impl ArrowReaderOptions {
}
}

/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The clone-able metadata necessary to construct a [`ArrowReaderBuilder`]
/// The metadata necessary to construct a [`ArrowReaderBuilder`].
///
/// This structure is inexpensive to clone.

@tustvold tustvold removed the api-change Changes to the arrow API label Aug 10, 2023
@tustvold
Copy link
Contributor Author

I marked this PR as an api-change as I think it is -- let me know if you disagree

Where is the breaking change you are refering to?

Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.

https://github.com/apache/arrow-rs/pull/4676/files#diff-850b3a44587149637b8545f66603a2b1252959fd36f7ddc55f37d6b5357816c6R374

?

@alamb
Copy link
Contributor

alamb commented Aug 10, 2023

I marked this PR as an api-change as I think it is -- let me know if you disagree

Where is the breaking change you are refering to?

I was thinking of https://github.com/apache/arrow-rs/pull/4676/files#r1289997977 but I suppose that since SyncReader isn't documented it is unlikely that anyone is using it directly 🤔

Finally, I think this PR would be strongly with a test / example of the stated usecase: read metadata once, but create multiple readers for the same file.

https://github.com/apache/arrow-rs/pull/4676/files#diff-850b3a44587149637b8545f66603a2b1252959fd36f7ddc55f37d6b5357816c6R374

As sorry I missed that.

@tustvold
Copy link
Contributor Author

was thinking of https://github.com/apache/arrow-rs/pull/4676/files#r1289997977 but I suppose that since SyncReader isn't documented it is unlikely that anyone is using it directly thinking

The tuple field is not public, so cannot be used by external code

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

/// Loads [`ArrowReaderMetadata`] from the provided [`ChunkReader`]
///
/// See [`Self::new_with_metadata`] for how this can be used
pub fn load_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think it would be easier to reason about if the code to load metadata was on the ArrowReaderMetadata struct rather than on the related (but different) ParquetRecordBatchReaderBuilder etc

Agree with this. Personally it would make more sense to me if it was on ParquetRecordBatchReaderBuilder but I also don't have strong feelings either way

@tustvold tustvold merged commit ea19ce8 into apache:master Aug 10, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants