Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ParquetMetadataWriter to its own module, update documentation #6202

Merged
merged 8 commits into from
Aug 13, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 6, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

@AdrianB added ParquetMetadataWriter in #6197 but the documentation was on ThriftMetadataWriter.

I also think this is a very important API and thus deserves special attention to documentation

What changes are included in this PR?

This PR sdds some documentation / examples (which also ensure
the code compiles and can be used by crate users), and makes a few tweaks:

  1. remove unused ParquetMetadataWriter::write_page_index (the page index is written based on its presence)
  2. Changes ParquetMetadataWriter::finish(&mut self) to ParquetMetadataWriter::finish(mut self) so it can be used without being mut

note I did the changes in two commits so it is clearer what I changed.

  • the first commit just moves the code with no changes
  • the second commit adds docs, examples, does some test cleanup, etc

Are there any user-facing changes?

Yes there are some API changes in an as yet unreleased API

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 6, 2024
@@ -730,281 +729,6 @@ impl<'a, W: Write + Send> PageWriter for SerializedPageWriter<'a, W> {
}
}

/// Writes `crate::file::metadata` structures to a thrift encdoded byte streams
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was mostly moved without modification - see the second commit to see what changed

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Just a few minor suggestions/questions.

parquet/src/file/metadata/writer.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata/writer.rs Show resolved Hide resolved
parquet/src/file/metadata/writer.rs Show resolved Hide resolved
parquet/src/file/metadata/writer.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata/writer.rs Outdated Show resolved Hide resolved
///
/// [`FileMetaData`]: crate::format::FileMetaData
///
/// ```text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ascii art will get interesting when the new footer extension work begins 😅.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ready!

/// │┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ FileMetadata
/// │ OffsetIndex │ contains embedded
/// ││ (Optional) │◀┼ ─ │ offsets to
/// │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │ │ ColumnIndex and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do Bloom filters go? I've never used them, so genuinely don't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom filters can theoretically be written anywhere, in practice they're either written after each row groups or just before the column index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact @progval added the ability to write after each row group in this release: #5860

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain more how you would write BloomFilters in d51c697. Any corrections / comments would be appreciated

/// Note: this writer does not directly write BloomFilters. In order to write
/// BloomFilters, write the bloom filters into the buffer before creating the
/// metadata writer. Then set the corresponding `bloom_filter_offset` and
/// `bloom_filter_length` on [`ColumnChunkMetaData`] passed to this writer.

parquet/src/file/metadata/writer.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit. Looks ready to ship 📦

parquet/src/file/metadata/writer.rs Outdated Show resolved Hide resolved
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@alamb alamb merged commit 468a564 into apache:master Aug 13, 2024
16 checks passed
@alamb
Copy link
Contributor Author

alamb commented Aug 13, 2024

Thanks again @etseidl for the review

@alamb alamb deleted the alamb/make_writer_pub branch August 13, 2024 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants