Concatenate parquet files without deserializing? #1711

wjones127 · 2022-05-18T17:57:57Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This is a random idea, but it seems like it would be valuable to be able to concatenate parquet files without deserializing to Arrow and re-serializing back to Parquet. I'm not 100% sure that it would be possible, but does seem like you should in theory be able to just copy the row group buffers and then update the offsets within the row group metadata in the footer.

You can only do this if the schemas match, of course.

Describe the solution you'd like

If this is indeed possible, then some function like (apologies, my Rust interface design isn't great yet):

fn merge_files(readers: Vec<SerializedFileReader>, writer: impl FileWriter) -> Result<()>;

Describe alternatives you've considered

The obvious alternative is to simple read as Arrow, concatenate, and then serialize back, but reading and writing parquet is famously compute intensive, so would be nice if we could avoid that.

Additional context

Concatenating parquet files is a common operation in Delta Lake tables, which may initially write out many small files that later need to be merged for better read performance. See delta-io/delta-rs#98.

tustvold · 2022-05-20T16:40:21Z

This sounds like a good idea to me, and could possibly feed into some sort of story for parallel writing 👍

It is probably worth highlighting though that whilst merging parquet files without rewriting the row groups will theoretically reduce the IO required to fetch them from object storage, along with any catalog overheads, it likely won't help with the CPU-bound portion of actually decoding the bytes, nor with compression.

tustvold · 2023-06-01T15:03:30Z

Closed by #4269

wjones127 added the enhancement Any new improvement worthy of a entry in the changelog label May 18, 2022

This was referenced May 21, 2022

Trying to write parquet file in parallel results in corrupt file #1717

Closed

Support encoding a single parquet file using multiple threads #1718

Closed

wjones127 mentioned this issue Feb 5, 2023

2023 H1 Roadmap delta-io/delta-rs#1128

Closed

21 tasks

tustvold closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenate parquet files without deserializing? #1711

Concatenate parquet files without deserializing? #1711

wjones127 commented May 18, 2022 •

edited

Loading

tustvold commented May 20, 2022

tustvold commented Jun 1, 2023

Concatenate parquet files without deserializing? #1711

Concatenate parquet files without deserializing? #1711

Comments

wjones127 commented May 18, 2022 • edited Loading

tustvold commented May 20, 2022

tustvold commented Jun 1, 2023

wjones127 commented May 18, 2022 •

edited

Loading