Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No longer write Parquet column metadata after column chunks *and* in the footer #6117

Merged
merged 21 commits into from
Aug 2, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 25, 2024

Which issue does this PR close?

Closes #6115.

Rationale for this change

This metadata is now deprecated.

What changes are included in this PR?

Leaves file_offset as 0 and deprecates setting the field. Also stops writing an extra copy of ColumnMetaData.

Are there any user-facing changes?

Yes, since some metadata is no longer written (but considering it was incorrect, likely it is not widely used). Also deprecates ColumnChunkMetaDataBuilder::set_file_offset().

BugenZhao and others added 13 commits July 16, 2024 18:05
)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
…ally copies data (apache#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy
* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.
* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays
* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…MetaData` (apache#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 25, 2024
@alamb alamb deleted the branch apache:master July 26, 2024 10:11
@alamb alamb closed this Jul 26, 2024
@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

I merged the 53 dev branch and that seems to have closed this PR -- any chance you can retarget main?

Update: I restored the branch

@alamb alamb reopened this Jul 26, 2024
@alamb alamb added the api-change Changes to the arrow API label Jul 26, 2024
@alamb alamb changed the title No longer write inline Parquet column metadata No longer write Parquet column metadata after column chunks *and* in the footer Jul 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl

I reviewed this carefully, not just the code but also the other uses of this field. I think this is a good change.

I did find a few other places that I think could be useful to deprecate / add some warnings / notes about. For example

https://github.com/apache/arrow-rs/blob/978264d49dc27225b570378e491ad69ae19f45cb/parquet/src/file/metadata/mod.rs#L685-L684

@@ -1023,8 +1017,6 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
}

let metadata = builder.build()?;
self.page_writer.write_metadata(&metadata)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major functional change I think -- I expect it would result in slightly smaller parquet files as the per column metadata is no longer written twice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac is here some impl using this metadata? Would a flag being better here? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has been discussed in the ML and should be safe to remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://lists.apache.org/thread/r6r2cvzrdoorq6h6gqwh0b1hbfbhxv29
apache/parquet-format#440

Would you mind find the detail message for that? I'll linking it to C++ issues there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best evidence that no readers use the copy outside the footer is this https://lists.apache.org/thread/ob2szsb0cv6fpwmvknss9jot1oqnxzsp

If any readers were using the alternate metadata they would have submitted bug reports by now 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sync with gang and I got the info that Parquet-Java doesn't write that and doesn't impl doesn't read that, let's rush forward

parquet/src/file/metadata/mod.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Can you please retarget this PR to the main branch and we'll get it merged in ?

let map_offset = |x| x - src_offset + write_offset as i64;
let mut builder = ColumnChunkMetaData::builder(metadata.column_descr_ptr())
.set_compression(metadata.compression())
.set_encodings(metadata.encodings().clone())
.set_file_offset(file_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below in line 677 the metadata is still written to self.buf, should that also be skipped? And maybe the whole PageWriter::write_metadata method could be removed or deprecated?

Copy link
Contributor Author

@etseidl etseidl Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe this needs to be made clearer in the documentation. The spec still requires the field to be written so old readers don't break (i.e. it is still marked required in the thrift IDL). The correct behavior now is to write a 0 here.

Oh, I misunderstood your comment...the write below is correct as this is writing the copy of the ColumnMetaData in the footer. The change in this PR is to not also write a copy of ColumnMetaData at the end of each column chunk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the write below is correct as this is writing the copy of the ColumnMetaData in the footer

The code is hard to follow, but my understanding is that this method copies an existing ColumnChunk from a ChunkReader. The buf seems to be the same that page data is written too, and I assume that append_column could be called multiple times . The metadata seems to also get collected inside the on_close closure and will in the end be written by SerializedFileWriter::write_metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this code:

close.metadata = builder.build()?;

Sets the metadata that will be eventually written into the file footer.

This code, perhaps what @jhorstmann is referring to (the line numbers appear to have changed):

        SerializedPageWriter::new(self.buf).write_metadata(&metadata)?;

Also seems to write the contents of close.metadata into the page.

In this case, it does seem like it could be avoided too

Copy link
Contributor Author

@etseidl etseidl Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, I had somehow convinced myself that this method was in the normal write path and that the write in question was necessary :( I think I need to add a check that file_offset is 0 to several tests. And also check for gaps between column chunks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that rippled a bit. 😅 Let me know if this is ok, or if write_metadata should be deprecated first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing write_metadata looks good to me

@etseidl etseidl changed the base branch from 53.0.0-dev to master July 26, 2024 14:45
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Jul 26, 2024
@mapleFU
Copy link
Member

mapleFU commented Jul 26, 2024

Does this pr remain un-related patches? I found this is hard to review 🤔

@etseidl
Copy link
Contributor Author

etseidl commented Jul 26, 2024

Does this pr remain un-related patches? I found this is hard to review 🤔

Eek, I'm not quite sure what happened...let me try to merge with master and see if that clears things up. Thanks @mapleFU for pointing this out!

Ok, it's fixed now.

@github-actions github-actions bot removed arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Jul 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl and @jhorstmann -- I think this looks good to me. I'll leave this PR open for another day or two and then I plan to merge it

/// Byte offset in `file_path()`.
/// Byte offset of `ColumnMetaData` in `file_path()`.
///
/// Note that the meaning of this field has been inconsistent between implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let map_offset = |x| x - src_offset + write_offset as i64;
let mut builder = ColumnChunkMetaData::builder(metadata.column_descr_ptr())
.set_compression(metadata.compression())
.set_encodings(metadata.encodings().clone())
.set_file_offset(file_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing write_metadata looks good to me

@alamb alamb merged commit f2de2cd into apache:master Aug 2, 2024
16 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 2, 2024

Thanks again 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ColumnMetaData should no longer be written inline with data
10 participants