-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of reading and writing parquet metadata outside the file #6081
base: master
Are you sure you want to change the base?
Conversation
/// Specifically it: | ||
/// 1. It reads the metadata of a Parquet file | ||
/// 2. Removes some column statistics from the metadata (to make them smaller) | ||
/// 3. Stores the metadata in a separate file | ||
/// 4. Reads the metadata from the separate file and uses that to read the Parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I feel like we can simplify this example a bit. My use case is essentially along the lines of https://github.com/apache/datafusion/pull/10701/files#diff-81450b08df2ee29b3a9069865fc4f0c94883023c9d75bde729756c6bb4ec630d but instead of the metadata cache being in-memory you can imagine it's on disk (so that e.g. I can cache more metadata than would fit in memory).
Maybe this can be modeled something like:
struct KeyValueStore {
storage: HashMap<String, Vec<u8>>
}
impl KeyValueStore {
pub async fn get(&self, key: String) -> &[u8];
pub async fn set(&self, key: String, value: Vec<u8>);
}
The point being that we serialize the metadata to the key value store and then deserialize it from there, passing it into the reader instead of having the reader get it from the file itself. I don't think the editing of the metadata is necessary to get this example across.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, make sense. I agree maybe that is being overly ambitious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reading/writing to a file is pretty similar and actually using a kv store might make it more complicated so I kept a file for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is fine by me 😄. Maybe a comment about storing the metadata in a fast cache like Redis or in a metadata store will be enough to spark imagination?
ea603d4
to
ddd4240
Compare
/// The data is stored using the same thrift format as the Parquet file metadata | ||
fn write_metadata_to_file(metadata: ParquetMetaData, file: impl AsRef<Path>) { | ||
let file = std::fs::File::create(file).unwrap(); | ||
let writer = ParquetMetaDataWriter::new(file, &metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new API from #6197 and it is 👨🍳 👌 very nice
/// This function reads the format written by `write_metadata_to_file` | ||
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData { | ||
let mut file = std::fs::File::open(file).unwrap(); | ||
// This API is kind of awkward compared to the writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also filled out this part of the PR showing how to read the metadata back -- it is (very) ugly compared to the nice ParquetMetadataWriter
@adriangb any interest in creating a ParquetMetadataReader
API similar to ParquetMetadataWriter
that handles these details? If so I can create a ticket / review a PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes certainly interested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this as a plan: #6002 (comment)
let file = std::fs::File::open(file).unwrap(); | ||
let options = ArrowReaderOptions::new() | ||
// tell the reader to read the page index | ||
.with_page_index(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also kind of akward -- it would be great if the actual reading of the parquet metadata could do this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean loading the page index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what I was trying to get at was that since the ColumnIndex
and OffsetIndex
(aka the "Page index structures") are not store inline, decode_metadata
doesn't read them -- the logic to do so is baked into this reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to describe this more on #6002 (comment)
/// This function reads the format written by `write_metadata_to_file` | ||
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData { | ||
let mut file = std::fs::File::open(file).unwrap(); | ||
// This API is kind of awkward compared to the writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes certainly interested!
let file = std::fs::File::open(file).unwrap(); | ||
let options = ArrowReaderOptions::new() | ||
// tell the reader to read the page index | ||
.with_page_index(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean loading the page index?
Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let column_indexes = self.convert_column_indexes(); | ||
let offset_indexes = self.convert_offset_index(); | ||
|
||
let mut encoder = ThriftMetadataWriter::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would encoder
better serializer
here?
(very much a WIP)
Which issue does this PR close?
Related to #6002
Rationale for this change
To figure out a good API we need an example of what we are trying to do
What changes are included in this PR?
Are there any user-facing changes?
Not yet, just an example