Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43427: [C++][Parquet] Deprecate ColumnChunk::file_offset field and no longer write Metadata at end of Chunk #43428

Merged
merged 11 commits into from
Aug 7, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jul 26, 2024

Rationale for this change

See github issue

What changes are included in this PR?

Force ColumnChunk::file_offset tobe 0

Are these changes tested?

No

Are there any user-facing changes?

Might affect user using legacy ColumnChunk::file_offset

@mapleFU mapleFU requested a review from wgtmac as a code owner July 26, 2024 03:29
Copy link

⚠️ GitHub issue #43427 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Jul 26, 2024

Also cc @emkornfield @etseidl

@github-actions github-actions bot added the awaiting review Awaiting review label Jul 26, 2024
@wgtmac
Copy link
Member

wgtmac commented Jul 26, 2024

We should not write ColumnMetaData at the end of ColumnChunk any more:

metadata_->WriteTo(in_memory_sink_.get());

@mapleFU
Copy link
Member Author

mapleFU commented Aug 3, 2024

@wgtmac @etseidl I've update the comment and impl, would you mind take a look again? I'll check in before Auguest 6th if pitrou is not back 🤔

@mapleFU mapleFU changed the title GH-43427: [C++][Parquet] Deprecate ColumnChunk::file_offset field GH-43427: [C++][Parquet] Deprecate ColumnChunk::file_offset field and no longer write Metadata at end of Chunk Aug 3, 2024
// Write metadata at end of column chunk
metadata_->WriteTo(sink_.get());

// Not write metadata at end of column chunk since we will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather removing these lines to not confuse readers.

@@ -667,8 +669,9 @@ class BufferedPageWriter : public PageWriter {
has_dictionary, fallback, pager_->dict_encoding_stats_,
pager_->data_encoding_stats_, pager_->meta_encryptor_);

// Write metadata at end of column chunk
metadata_->WriteTo(in_memory_sink_.get());
// Not write metadata at end of column chunk since we will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 1540 to 1542
// https://github.com/apache/parquet-format/pull/440
// The `file_offset` field is deprecated and should be set to 0 for writer
// if the column chunk has not been written outsidethe footer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// https://github.com/apache/parquet-format/pull/440
// The `file_offset` field is deprecated and should be set to 0 for writer
// if the column chunk has not been written outsidethe footer.
// The `file_offset` field is deprecated and should be set to 0.
// See https://github.com/apache/parquet-format/pull/440 for detail.

// column chunk
// Byte offset of `ColumnMetaData` in `file_path()`.
//
// Note that the meaning of this field has been inconsistent between implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note that the meaning of this field has been inconsistent between implementations
// Note that the meaning of this field has been inconsistent among implementations

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 3, 2024
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits. Looks great. Thanks for doing this @mapleFU!

// Byte offset of `ColumnMetaData` in `file_path()`.
//
// Note that the meaning of this field has been inconsistent between implementations
// so its use has since been deprecated in the Parquet specification. Moder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// so its use has since been deprecated in the Parquet specification. Moder
// so its use has since been deprecated in the Parquet specification. Modern

}
// https://github.com/apache/parquet-format/pull/440
// The `file_offset` field is deprecated and should be set to 0 for writer
// if the column chunk has not been written outsidethe footer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if the column chunk has not been written outsidethe footer.
// if the column chunk has not been written outside the footer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removed lol

@mapleFU
Copy link
Member Author

mapleFU commented Aug 3, 2024

Would fix this later:

    @pytest.mark.pandas
    def test_parquet_metadata_api():

@mapleFU
Copy link
Member Author

mapleFU commented Aug 4, 2024

Failed CI is unrelated. Will merge if no more negative comment in August 6th

@mapleFU mapleFU requested a review from pitrou August 6, 2024 02:32
@mapleFU
Copy link
Member Author

mapleFU commented Aug 7, 2024

Merge because collect 1 approve

@mapleFU mapleFU merged commit 9b58454 into apache:main Aug 7, 2024
30 of 33 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Aug 7, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 9b58454.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 31 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants