Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenate parquet files without deserializing? #1711

Closed
wjones127 opened this issue May 18, 2022 · 2 comments
Closed

Concatenate parquet files without deserializing? #1711

wjones127 opened this issue May 18, 2022 · 2 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@wjones127
Copy link
Member

wjones127 commented May 18, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This is a random idea, but it seems like it would be valuable to be able to concatenate parquet files without deserializing to Arrow and re-serializing back to Parquet. I'm not 100% sure that it would be possible, but does seem like you should in theory be able to just copy the row group buffers and then update the offsets within the row group metadata in the footer.

You can only do this if the schemas match, of course.

Describe the solution you'd like

If this is indeed possible, then some function like (apologies, my Rust interface design isn't great yet):

fn merge_files(readers: Vec<SerializedFileReader>, writer: impl FileWriter) -> Result<()>;

Describe alternatives you've considered

The obvious alternative is to simple read as Arrow, concatenate, and then serialize back, but reading and writing parquet is famously compute intensive, so would be nice if we could avoid that.

Additional context

Concatenating parquet files is a common operation in Delta Lake tables, which may initially write out many small files that later need to be merged for better read performance. See delta-io/delta-rs#98.

@wjones127 wjones127 added the enhancement Any new improvement worthy of a entry in the changelog label May 18, 2022
@tustvold
Copy link
Contributor

This sounds like a good idea to me, and could possibly feed into some sort of story for parallel writing 👍

It is probably worth highlighting though that whilst merging parquet files without rewriting the row groups will theoretically reduce the IO required to fetch them from object storage, along with any catalog overheads, it likely won't help with the CPU-bound portion of actually decoding the bytes, nor with compression.

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

Closed by #4269

@tustvold tustvold closed this as completed Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants