Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support async writer (#1269) #3957

Merged
merged 4 commits into from
Mar 28, 2023
Merged

Conversation

ShiKaiWi
Copy link
Member

@ShiKaiWi ShiKaiWi commented Mar 27, 2023

Which issue does this PR close?

Closes #1269.

Rationale for this change

Currently, async api for arrow writer has not been supported yet. And if the underlying storage's api is async, it is difficult to use the arrow writer api, and a workaround way is to collect all the bytes in memory and then feed them to the underlying storage by the async api, which may lead to high memory consumption if the final parquet file is large.

So we need an async api for the arrow writer, allowing that the caller can integrate an async underlying storage easily, e.g object store.

What changes are included in this PR?

  • Implement the async arrow writer based on the sync arrow writer.
  • The inner buffer of the async arrow writer can be configured with an option called buffer_flush_threshold to allow the caller can control the memory usage.

Are there any user-facing changes?

Async api for Arrow writer: AsyncArrowWriter.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 27, 2023
@ShiKaiWi
Copy link
Member Author

@tustvold PTAL

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good to me 👍

@ShiKaiWi
Copy link
Member Author

The ci is broken, I will fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide an async ParquetWriter for arrow
2 participants