Async writer tweaks #3967

tustvold · 2023-03-28T11:22:11Z

Which issue does this PR close?

Closes #.

Rationale for this change

More consistent buffer handling

What changes are included in this PR?

This makes two relatively minor tweaks to the AsyncArrowWriter added in #3957 by @ShiKaiWi:

Uses a futures::Mutex to avoid needing to take and replace the SharedBuffer
Explicit sizing of intermediate buffer, with eager allocation, to avoid expensive bump allocation

Are there any user-facing changes?

No changes to released APIs

tustvold · 2023-03-28T11:22:37Z

@ShiKaiWi perhaps you could take a look and let me know what you think

tustvold · 2023-03-28T11:23:13Z

parquet/src/arrow/async_writer/mod.rs

        props: Option<WriterProperties>,
    ) -> Result<Self> {
-        let shared_buffer = SharedBuffer::default();
+        let shared_buffer = SharedBuffer::new(buffer_size);


This is the major motivation for this PR, being able to avoid bump allocation where the Vec is repeatedly resized is important for performance

Actually, in the #3957, buffer_flush_threshold is designed to be able to be usize::MAX in order to let the async writer not flush until all the encoding work is done. And for this reason, the buffer can't be pre-allocated at initialization.

And now I think it looks good here because of its efficiency, and it may be a fake feature to let the writer do flush only when all encoded bytes are ready. 😆

fake feature to let the writer do flush only when all encoded bytes are read

Yeah, at that point you might as well just use the sync writer 😅

tustvold · 2023-03-28T11:24:42Z

parquet/src/arrow/async_writer/mod.rs


        Ok(metadata)
    }

    /// Flush the data in the [`SharedBuffer`] into the `async_writer` if its size
    /// exceeds the threshold.
    async fn try_flush(
-        shared_buffer: &SharedBuffer,
+        shared_buffer: &mut SharedBuffer,


A mutable reference isn't technically required here, but acts as a lint that shared_buffer shouldn't be shared

Could we actually just remove the Mutex entirely? Hold a Arc<SharedBuffer> and use Arc::get_mut to grab a mutable reference

Arc::get_mut only works if there are no other Arc references, which in this case wouldn't be the case

Ah right the async writer would also need a reference. I suppose you could hold an Arc<Vec<u8>> in the the async writer and then have SharedBuffer hold a Weak<Vec<u8>>. Not sure that would end up pencilling out just to remove an uncontended mutex lock though.

The use of try_lock here boils down to much the same thing - https://docs.rs/futures-util/0.3.27/src/futures_util/lock/mutex.rs.html#103

ShiKaiWi · 2023-03-28T11:55:42Z

@ShiKaiWi perhaps you could take a look and let me know what you think

This PR looks fairly pretty for me. Learns a lot from it.

github-actions bot added the parquet Changes to the parquet crate label Mar 28, 2023

tustvold commented Mar 28, 2023

View reviewed changes

Async writer tweaks

8660dd4

tustvold force-pushed the async-writer-tweaks branch from 3a51a2e to 8660dd4 Compare March 28, 2023 11:24

tustvold commented Mar 28, 2023

View reviewed changes

Use capacity

5ac41eb

tustvold merged commit 9eb3490 into apache:master Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async writer tweaks #3967

Async writer tweaks #3967

tustvold commented Mar 28, 2023

tustvold commented Mar 28, 2023

tustvold Mar 28, 2023 •

edited

Loading

ShiKaiWi Mar 28, 2023 •

edited

Loading

tustvold Mar 28, 2023

tustvold Mar 28, 2023

thinkharderdev Mar 28, 2023

tustvold Mar 28, 2023

thinkharderdev Mar 28, 2023

tustvold Mar 28, 2023

ShiKaiWi commented Mar 28, 2023 •

edited

Loading

Async writer tweaks #3967

Async writer tweaks #3967

Conversation

tustvold commented Mar 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Mar 28, 2023

tustvold Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

ShiKaiWi Mar 28, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold Mar 28, 2023

Choose a reason for hiding this comment

tustvold Mar 28, 2023

Choose a reason for hiding this comment

thinkharderdev Mar 28, 2023

Choose a reason for hiding this comment

tustvold Mar 28, 2023

Choose a reason for hiding this comment

thinkharderdev Mar 28, 2023

Choose a reason for hiding this comment

tustvold Mar 28, 2023

Choose a reason for hiding this comment

ShiKaiWi commented Mar 28, 2023 • edited Loading

tustvold Mar 28, 2023 •

edited

Loading

ShiKaiWi Mar 28, 2023 •

edited

Loading

ShiKaiWi commented Mar 28, 2023 •

edited

Loading