Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async writer tweaks #3967

Merged
merged 2 commits into from
Mar 30, 2023
Merged

Async writer tweaks #3967

merged 2 commits into from
Mar 30, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

More consistent buffer handling

What changes are included in this PR?

This makes two relatively minor tweaks to the AsyncArrowWriter added in #3957 by @ShiKaiWi:

  • Uses a futures::Mutex to avoid needing to take and replace the SharedBuffer
  • Explicit sizing of intermediate buffer, with eager allocation, to avoid expensive bump allocation

Are there any user-facing changes?

No changes to released APIs

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 28, 2023
@tustvold
Copy link
Contributor Author

@ShiKaiWi perhaps you could take a look and let me know what you think

props: Option<WriterProperties>,
) -> Result<Self> {
let shared_buffer = SharedBuffer::default();
let shared_buffer = SharedBuffer::new(buffer_size);
Copy link
Contributor Author

@tustvold tustvold Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the major motivation for this PR, being able to avoid bump allocation where the Vec is repeatedly resized is important for performance

Copy link
Member

@ShiKaiWi ShiKaiWi Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in the #3957, buffer_flush_threshold is designed to be able to be usize::MAX in order to let the async writer not flush until all the encoding work is done. And for this reason, the buffer can't be pre-allocated at initialization.

And now I think it looks good here because of its efficiency, and it may be a fake feature to let the writer do flush only when all encoded bytes are ready. 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fake feature to let the writer do flush only when all encoded bytes are read

Yeah, at that point you might as well just use the sync writer 😅


Ok(metadata)
}

/// Flush the data in the [`SharedBuffer`] into the `async_writer` if its size
/// exceeds the threshold.
async fn try_flush(
shared_buffer: &SharedBuffer,
shared_buffer: &mut SharedBuffer,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mutable reference isn't technically required here, but acts as a lint that shared_buffer shouldn't be shared

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we actually just remove the Mutex entirely? Hold a Arc<SharedBuffer> and use Arc::get_mut to grab a mutable reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc::get_mut only works if there are no other Arc references, which in this case wouldn't be the case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right the async writer would also need a reference. I suppose you could hold an Arc<Vec<u8>> in the the async writer and then have SharedBuffer hold a Weak<Vec<u8>>. Not sure that would end up pencilling out just to remove an uncontended mutex lock though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of try_lock here boils down to much the same thing - https://docs.rs/futures-util/0.3.27/src/futures_util/lock/mutex.rs.html#103

@ShiKaiWi
Copy link
Member

ShiKaiWi commented Mar 28, 2023

@ShiKaiWi perhaps you could take a look and let me know what you think

This PR looks fairly pretty for me. Learns a lot from it.

@tustvold tustvold merged commit 9eb3490 into apache:master Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants