Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of reading and writing parquet metadata outside the file #6081

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 17, 2024

(very much a WIP)

Which issue does this PR close?

Related to #6002

Rationale for this change

To figure out a good API we need an example of what we are trying to do

What changes are included in this PR?

  1. Adds an example, with comments
  2. The example is based on my interpretation of @adriangb's description here API for encoding/decoding ParquetMetadata with more control #6002 (comment)

Are there any user-facing changes?

Not yet, just an example

Comment on lines 33 to 39
/// Specifically it:
/// 1. It reads the metadata of a Parquet file
/// 2. Removes some column statistics from the metadata (to make them smaller)
/// 3. Stores the metadata in a separate file
/// 4. Reads the metadata from the separate file and uses that to read the Parquet file
Copy link
Contributor

@adriangb adriangb Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I feel like we can simplify this example a bit. My use case is essentially along the lines of https://github.com/apache/datafusion/pull/10701/files#diff-81450b08df2ee29b3a9069865fc4f0c94883023c9d75bde729756c6bb4ec630d but instead of the metadata cache being in-memory you can imagine it's on disk (so that e.g. I can cache more metadata than would fit in memory).

Maybe this can be modeled something like:

struct KeyValueStore {
    storage: HashMap<String, Vec<u8>>
}

impl KeyValueStore {
    pub async fn get(&self, key: String) -> &[u8];
    pub async fn set(&self, key: String, value: Vec<u8>);
}

The point being that we serialize the metadata to the key value store and then deserialize it from there, passing it into the reader instead of having the reader get it from the file itself. I don't think the editing of the metadata is necessary to get this example across.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, make sense. I agree maybe that is being overly ambitious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reading/writing to a file is pretty similar and actually using a kv store might make it more complicated so I kept a file for now

Copy link
Contributor

@adriangb adriangb Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is fine by me 😄. Maybe a comment about storing the metadata in a fast cache like Redis or in a metadata store will be enough to spark imagination?

@alamb alamb force-pushed the alamb/parquet-stats-example branch from ea603d4 to ddd4240 Compare August 6, 2024 22:10
/// The data is stored using the same thrift format as the Parquet file metadata
fn write_metadata_to_file(metadata: ParquetMetaData, file: impl AsRef<Path>) {
let file = std::fs::File::create(file).unwrap();
let writer = ParquetMetaDataWriter::new(file, &metadata);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new API from #6197 and it is 👨‍🍳 👌 very nice

/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
let mut file = std::fs::File::open(file).unwrap();
// This API is kind of awkward compared to the writer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also filled out this part of the PR showing how to read the metadata back -- it is (very) ugly compared to the nice ParquetMetadataWriter

@adriangb any interest in creating a ParquetMetadataReader API similar to ParquetMetadataWriter that handles these details? If so I can create a ticket / review a PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes certainly interested!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this as a plan: #6002 (comment)

let file = std::fs::File::open(file).unwrap();
let options = ArrowReaderOptions::new()
// tell the reader to read the page index
.with_page_index(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also kind of akward -- it would be great if the actual reading of the parquet metadata could do this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean loading the page index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I was trying to get at was that since the ColumnIndex and OffsetIndex (aka the "Page index structures") are not store inline, decode_metadata doesn't read them -- the logic to do so is baked into this reader

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to describe this more on #6002 (comment)

parquet/examples/external_metadata.rs Outdated Show resolved Hide resolved
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
let mut file = std::fs::File::open(file).unwrap();
// This API is kind of awkward compared to the writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes certainly interested!

let file = std::fs::File::open(file).unwrap();
let options = ArrowReaderOptions::new()
// tell the reader to read the page index
.with_page_index(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean loading the page index?

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let column_indexes = self.convert_column_indexes();
let offset_indexes = self.convert_offset_index();

let mut encoder = ThriftMetadataWriter::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would encoder better serializer here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants