POC: Add ParquetMetaDataReader #6392

etseidl · 2024-09-13T00:28:26Z

Which issue does this PR close?

Relates to #6002

Rationale for this change

This is an attempt to consolidate Parquet footer/page index reading/parsing into a single place.

What changes are included in this PR?

The new ParquetMetaDataReader basically takes the code in parquet/src/file/footer.rs and parquet/src/arrow/async_reader/metadata.rs and mashes them together into a single API. Using this, the read_metadata_from_file call from #6081 would become:

fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
    let reader = ParquetMetaDataReader::new()
        .with_page_indexes(true);
    let mut file = std::fs::File::open(file).unwrap();
    reader.try_parse(file).unwrap();
    // return ParquetMetaData with page indexes populated
    reader.finish().unwrap()
}

Also included are two async functions try_load() and try_load_from_tail(). The former is a combination of MetadataLoader::load() and MetadataLoader::load_page_index. The latter is an attempt at addressing the issue of loading the footer when the file size is not known, so it requires being able to seek from the end of the file.

This implementation is very rough, with not enough safety checking and documentation. At this point I'm hoping for feedback on the approach. If this seems at all useful, then a path forward would be to first add ParquetMetaDataReader alone, and then in subsequent PRs begin to use it as a replacement for other functions which could then be deprecated. The idea is to get as much in without breaking changes, and then introduce the breaking changes once 54.0.0 is open.

Are there any user-facing changes?

Eventually, yes.

etseidl · 2024-09-13T00:29:46Z

@alamb @adriangb I'm hoping you'll have time to take a look and see if this is a step in the right direction.

alamb · 2024-09-15T14:05:10Z

Thank you @etseidl this looks amazing. I hope to give it a look over the next few days

adriangb

This generally looks great to me! The new API with the sync / async method is perfect. Amazing work!

The TODO with the ability to load from a byte range that only has the metadata is the one piece missing for this to satisfy our use case, but we can easily work around that by passing in a ChunkReader that lies about the offsets it's loading from and errors if you try to access an offset outside of it's range.

etseidl · 2024-09-17T18:58:01Z

Thanks @adriangb. I've added the functionality from the TODO (try_parse_range()), will this work for you?

adriangb · 2024-09-17T19:30:04Z

Amazing speed, yes that's perfect!

Xuanwo

Thank you very much for working on this. Here are some suggestions from my side.

Xuanwo · 2024-09-18T05:37:24Z

parquet/src/file/metadata/reader.rs

+    prefetch_hint: Option<usize>,
+}
+
+impl Default for ParquetMetaDataReader {


Seems we can derive(Default) and use Self::default() in fn new()?

Of course. Done.

Xuanwo · 2024-09-18T05:41:02Z

parquet/src/file/metadata/reader.rs

+        if self.metadata.is_none() {
+            return Err(general_err!("could not parse parquet metadata"));
+        }
+        Ok(self.metadata.take().unwrap())


Hi, I know the unwrap() here is safe, but what about replacing it with

self.metadata .take() .ok_or_else(|| general_err!("could not parse parquet metadata"))

I believe it can eliminate an extra check and improve readability.

Thanks, still learning Rust idioms :)

Xuanwo · 2024-09-18T05:43:11Z

parquet/src/file/metadata/reader.rs

+
+        // Get bounds needed for page indexes (if any are present in the file).
+        let range = self.range_for_page_index();
+        let range = match range {


How about using:

let Some(range) = self.range_for_page_index() else { return Ok(()); };

Xuanwo · 2024-09-18T05:46:12Z

parquet/src/file/metadata/reader.rs

+    /// least two fetches, regardless of the value of `prefetch_hint`, if the page indexes are
+    /// requested.
+    #[cfg(feature = "async")]
+    pub async fn try_load_from_tail<R: AsyncFileReader + AsyncRead + AsyncSeek + Unpin + Send>(


Hi, requiring an additional bound of AsyncRead + AsyncSeek is a bit confusing. Could you provide more context?

I'm echoing the bounds from here. AsyncFileReader isn't really necessary, though.

I'm not sure if this design is correct. How about removing this API until we resolve #6157? We can bring this API back while we have native suffix read support.

Ok, removed. TBH I don't fully understand the issue in #6157 and thought the approach in AsyncFileReader::get_metadata could be an alternative solution.

alamb

Thank you very much @etseidl -- I basically agree with @adriangb and @Xuanwo that this is wonderful work

This is an attempt to consolidate Parquet footer/page index reading/parsing into a single place.

It is amazing

This implementation is very rough, with not enough safety checking and documentation. At this point I'm hoping for feedback on the approach.

My feedback is it is great -- thank you to @adriangb and @Xuanwo for their earlier feedback.

If this seems at all useful, then a path forward would be to first add ParquetMetaDataReader alone, and then in subsequent PRs begin to use it as a replacement for other functions which could then be deprecated. The idea is to get as much in without breaking changes, and then introduce the breaking changes once 54.0.0 is open.

I think this plan makes a lot of sense to me (and I think we can avoid most breaking changes -- deprecating is not a breaking change in my perspective)

alamb · 2024-09-19T16:55:44Z

parquet/src/errors.rs

@@ -61,6 +66,8 @@ impl std::fmt::Display for ParquetError {
                write!(fmt, "Index {index} out of bound: {bound}")
            }
            ParquetError::External(e) => write!(fmt, "External: {e}"),
+            ParquetError::NeedMoreData(needed) => write!(fmt, "NeedMoreData: {needed}"),


A nitpick is that these seems pretty similar . I wonder if it would make sense to combine them somehow 🤔

I wasn't intending two, it just turned out that way. I could make the second usize optional with the understanding that a range is being requested.

Also, does adding to the enum make this a breaking change? If so, I could go back to my tortured use of IndexOutOfBound until it's open season on breaking changes.

Also, does adding to the enum make this a breaking change? If so, I could go back to my tortured use of IndexOutOfBound until it's open season on breaking changes.

Yes, unfortunately, it does make it a breaking change

https://github.com/apache/arrow-rs/blob/master/parquet/src/errors.rs#L29

We should probably mark the error type as "non exhaustive" which would make it a non breaking change in the future

alamb · 2024-09-19T17:01:31Z

parquet/src/arrow/async_reader/metadata.rs

@@ -237,8 +236,10 @@ where
    Fut: Future<Output = Result<Bytes>> + Send,
 {
    let fetch = MetadataFetchFn(fetch);
-    let loader = MetadataLoader::load(fetch, file_size, prefetch).await?;
-    Ok(loader.finish())
+    // TODO(ets): should add option to read page index to this function


An alternative perhaps would be to deprecate fetch_parquet_metadata entirely and suggest people use ParquetMetaDataReader which s more complete and full featured -- I think we could deprecate this function in a minor release (we can't remover it until a major release)

alamb · 2024-09-19T17:07:09Z

parquet/src/file/metadata/reader.rs

+/// arguments).
+///
+/// [Page Index]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
+pub fn parquet_metadata_from_file<R: ChunkReader>(


A nitpick is maybe we can call this "parquet_metadata_from_reader`

Also I wonder if instead of a new API it would make sense to always directly use ParquetMetaDataReader directly. That would certainly be more verbose, but it also might be more explicit.

For the common case that the wrapping code won't retry (aka all the callsites of parquet_metadata_from_file, we could also add some sort of consuming API too that combines try_parse and finish to make it less verbose. Something like

let metadata = ParquetMetaDataReader::new() .with_column_indexes(column_index) .with_offset_indexes(offset_index) .parse(file)?;

Yes, that seems reasonable. And yes, I struggle with naming things 😄.

alamb · 2024-09-19T17:20:02Z

parquet/src/file/metadata/reader.rs

+    }
+
+    /// Same as [`Self::try_parse()`], but only `file_range` bytes of the original file are
+    /// available. `file_range.end` must point to the end of the file.


I was a little confused about how file_range works in this case (given that it seems to me that ChunkReader would in theory allow reading arbitrary ranges)

Is the idea that try_parse_range limits the requests to the reader so they are only within file_range?

It's for the case of impl ChunkReader for Bytes. Say you've speculatively read the last 1000 bytes of a file into a buffer, and let's say that's actually sufficient for the page indexes. But the locations of the page indexes are absolute (in this case they're in the range 1100..1500), so you can't actually find the indexes in the buffer unless you know the file offset of the beginning of the buffer (trying to seek to 1100 in a 1000 byte buffer clearly won't work...you need to subtract 1000 from the absolute offsets and read 100..500 from the buffer).

I also wanted different error behavior for the File vs Bytes use cases. If a File is passed, there's no sense in asking for a larger file; if the metadata can't be read there's either some I/O error going on or a corrupted file.

Perhaps we could instead just pass in the file size, while still mandating that if Bytes are passed they must include the footer. We could then infer the range as file_size-reader.len()..file_size, and then errors could simply return the number of bytes needed from the tail of the file. This would perhaps solve the above issue with two new and very similar errors types.

alamb · 2024-09-19T17:24:05Z

parquet/src/file/metadata/reader.rs

+    }
+
+    /// Attempts to (asynchronously) parse the footer metadata (and optionally page indexes)
+    /// given a [`MetadataFetch`]. The file size must be known to use this function.


It might also be good to note here that try_load will attempt to minimize the number of calls to fetch by prefetching but may make potentially multiple requests depending on how the data is laid out.

As an aside (and not changed in this PR), I found the use of MetadataFetch as basically an async version of ChunkReader confusing when trying to understand this API

It might also be good to note here that try_load will attempt to minimize the number of calls to fetch by prefetching but may make potentially multiple requests depending on how the data is laid out.

I can add a reference back to with_prefetch_hint where this is explained already.

As an aside (and not changed in this PR), I found the use of MetadataFetch as basically an async version of ChunkReader confusing when trying to understand this API

I'll admit to not being well versed in the subtleties of async code. And I am trying for a drop-in replacement for MetadataLoader initially. Do you think using AsyncFileReader would be cleaner/clearer?

alamb · 2024-09-19T17:30:34Z

The TODO with the ability to load from a byte range that only has the metadata is the one piece missing for this to satisfy our use case, but we can easily work around that by passing in a ChunkReader that lies about the offsets it's loading from and errors if you try to access an offset outside of it's range.

@adriangb I also found the use of "offsets" confusing as they parquet metadata has and uses offsets that are always "absolute" offsets within the overall file. Maybe we can make / update the example in #6081. Speaking of which, I will go do that now.

etseidl · 2024-09-19T18:28:30Z

Thank you @adriangb @Xuanwo @alamb for the helpful comments 🙏 . I'll incorporate your suggestions and then resubmit a PR with just the ParquetMetaDataReader. Once that's mergeworthy I can start deprecating things.

I'll keep this draft open for a time in case others want to chime in.

replace try_parse_range() with try_parse_sized()

etseidl and others added 5 commits September 12, 2024 16:42

add ParquetMetaDataReader

0c5087f

add todo

d5b60ab

add more todos

d462fda

take a stab at reading metadata without file size provided

6b9dd1c

temporarily comment out MetadataLoader

0a2c4b2

github-actions bot added the parquet Changes to the parquet crate label Sep 13, 2024

etseidl added 2 commits September 12, 2024 21:46

remove debug print

58f2463

clippy

08b985a

etseidl added 4 commits September 16, 2024 08:56

Merge remote-tracking branch 'origin/master' into metadata_reader

96062e1

add more todos

cdf6ac5

uncomment MetadataLoader

25e23d7

silence doc warnings

03bc663

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 apache/datafusion#12494

Open

8 tasks

fix size check

51a5a72

adriangb reviewed Sep 17, 2024

View reviewed changes

add try_parse_range

8a3f496

etseidl added 3 commits September 17, 2024 15:36

start on documentation

f8450e2

make sure docs compile

180e3e6

attempt recovery in test

9d1147d

Xuanwo reviewed Sep 18, 2024

View reviewed changes

etseidl added 3 commits September 17, 2024 23:43

implement some suggestions from review

1a1d3aa

remove suffix reading for now

d450ab8

add new error types to aid recovery

3c340b7

alamb reviewed Sep 19, 2024

View reviewed changes

etseidl added 8 commits September 19, 2024 13:05

remove parquet_metadata_from_file and add ParquetMetaDataReader::parse

0d13599

remove todo

d300cf3

point to with_prefetch_hint from try_load docstring

4ee162f

refactor the retry logic

2d65c3f

replace try_parse_range() with try_parse_sized()

Merge remote-tracking branch 'origin/master' into metadata_reader

2a2cf81

add some more tests

faff575

add load() and bring over tests from async_reader/metadata.rs

c9e5ea6

only run new tests if async is enabled

4214909

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Add ParquetMetaDataReader #6392

POC: Add ParquetMetaDataReader #6392

etseidl commented Sep 13, 2024

etseidl commented Sep 13, 2024

alamb commented Sep 15, 2024

adriangb left a comment

etseidl commented Sep 17, 2024

adriangb commented Sep 17, 2024 •

edited

Loading

Xuanwo left a comment

Xuanwo Sep 18, 2024

etseidl Sep 18, 2024

Xuanwo Sep 18, 2024

etseidl Sep 18, 2024

Xuanwo Sep 18, 2024

Xuanwo Sep 18, 2024

etseidl Sep 18, 2024

Xuanwo Sep 18, 2024 •

edited

Loading

etseidl Sep 18, 2024

alamb left a comment

alamb Sep 19, 2024

etseidl Sep 19, 2024

alamb Sep 19, 2024

alamb Sep 19, 2024

alamb Sep 19, 2024

etseidl Sep 19, 2024

alamb Sep 19, 2024

etseidl Sep 19, 2024

alamb Sep 19, 2024

etseidl Sep 19, 2024

alamb commented Sep 19, 2024

etseidl commented Sep 19, 2024 •

edited

Loading

POC: Add ParquetMetaDataReader #6392

Are you sure you want to change the base?

POC: Add ParquetMetaDataReader #6392

Conversation

etseidl commented Sep 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl commented Sep 13, 2024

alamb commented Sep 15, 2024

adriangb left a comment

Choose a reason for hiding this comment

etseidl commented Sep 17, 2024

adriangb commented Sep 17, 2024 • edited Loading

Xuanwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 19, 2024

etseidl commented Sep 19, 2024 • edited Loading

adriangb commented Sep 17, 2024 •

edited

Loading

Xuanwo Sep 18, 2024 •

edited

Loading

etseidl commented Sep 19, 2024 •

edited

Loading