-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No longer write Parquet column metadata after column chunks *and* in the footer #6117
Conversation
…ally copies data (apache#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy
* improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style.
* update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays
… end (#…" (apache#5933) This reverts commit 22e0b44.
This reverts commit 756b1fb.
* Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…MetaData` (apache#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
I merged the 53 dev branch Update: I restored the branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl
I reviewed this carefully, not just the code but also the other uses of this field. I think this is a good change.
I did find a few other places that I think could be useful to deprecate / add some warnings / notes about. For example
@@ -1023,8 +1017,6 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> { | |||
} | |||
|
|||
let metadata = builder.build()?; | |||
self.page_writer.write_metadata(&metadata)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the major functional change I think -- I expect it would result in slightly smaller parquet files as the per column metadata is no longer written twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac is here some impl using this metadata? Would a flag being better here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has been discussed in the ML and should be safe to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://lists.apache.org/thread/r6r2cvzrdoorq6h6gqwh0b1hbfbhxv29
apache/parquet-format#440
Would you mind find the detail message for that? I'll linking it to C++ issues there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best evidence that no readers use the copy outside the footer is this https://lists.apache.org/thread/ob2szsb0cv6fpwmvknss9jot1oqnxzsp
If any readers were using the alternate metadata they would have submitted bug reports by now 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sync with gang and I got the info that Parquet-Java doesn't write that and doesn't impl doesn't read that, let's rush forward
Can you please retarget this PR to the |
let map_offset = |x| x - src_offset + write_offset as i64; | ||
let mut builder = ColumnChunkMetaData::builder(metadata.column_descr_ptr()) | ||
.set_compression(metadata.compression()) | ||
.set_encodings(metadata.encodings().clone()) | ||
.set_file_offset(file_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below in line 677 the metadata is still written to self.buf
, should that also be skipped? And maybe the whole PageWriter::write_metadata
method could be removed or deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, maybe this needs to be made clearer in the documentation. The spec still requires the field to be written so old readers don't break (i.e. it is still marked required
in the thrift IDL). The correct behavior now is to write a 0 here.
Oh, I misunderstood your comment...the write below is correct as this is writing the copy of the ColumnMetaData
in the footer. The change in this PR is to not also write a copy of ColumnMetaData
at the end of each column chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the write below is correct as this is writing the copy of the ColumnMetaData in the footer
The code is hard to follow, but my understanding is that this method copies an existing ColumnChunk
from a ChunkReader
. The buf
seems to be the same that page data is written too, and I assume that append_column
could be called multiple times . The metadata seems to also get collected inside the on_close
closure and will in the end be written by SerializedFileWriter::write_metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this code:
close.metadata = builder.build()?;
Sets the metadata that will be eventually written into the file footer.
This code, perhaps what @jhorstmann is referring to (the line numbers appear to have changed):
SerializedPageWriter::new(self.buf).write_metadata(&metadata)?;
Also seems to write the contents of close.metadata
into the page.
In this case, it does seem like it could be avoided too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, I had somehow convinced myself that this method was in the normal write path and that the write in question was necessary :( I think I need to add a check that file_offset
is 0 to several tests. And also check for gaps between column chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that rippled a bit. 😅 Let me know if this is ok, or if write_metadata
should be deprecated first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing write_metadata looks good to me
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Does this pr remain un-related patches? I found this is hard to review 🤔 |
Eek, I'm not quite sure what happened...let me try to merge with master and see if that clears things up. Thanks @mapleFU for pointing this out! Ok, it's fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl and @jhorstmann -- I think this looks good to me. I'll leave this PR open for another day or two and then I plan to merge it
/// Byte offset in `file_path()`. | ||
/// Byte offset of `ColumnMetaData` in `file_path()`. | ||
/// | ||
/// Note that the meaning of this field has been inconsistent between implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let map_offset = |x| x - src_offset + write_offset as i64; | ||
let mut builder = ColumnChunkMetaData::builder(metadata.column_descr_ptr()) | ||
.set_compression(metadata.compression()) | ||
.set_encodings(metadata.encodings().clone()) | ||
.set_file_offset(file_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing write_metadata looks good to me
Thanks again 🚀 |
Which issue does this PR close?
Closes #6115.
Rationale for this change
This metadata is now deprecated.
What changes are included in this PR?
Leaves
file_offset
as 0 and deprecates setting the field. Also stops writing an extra copy ofColumnMetaData
.Are there any user-facing changes?
Yes, since some metadata is no longer written (but considering it was incorrect, likely it is not widely used). Also deprecates
ColumnChunkMetaDataBuilder::set_file_offset()
.