-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multipart partition object upload #99
Conversation
Also, handle read errors during multipart uploads as well.
…mum payload size requirements
As of the last implementation, the memory profile for the dataset addition from the docs ( However, when running CREATE EXTERNAL TABLE area1 STORED AS PARQUET LOCATION '/home/ubuntu/area1.parquet';
CREATE TABLE area1 AS SELECT * FROM staging.area1; via a psql terminal (using area1.parquet, with size about 2.45GB) I'm seeing: |
src/context.rs
Outdated
@@ -304,6 +311,8 @@ pub async fn plan_to_object_store( | |||
} | |||
writer.close().map_err(DataFusionError::from).map(|_| ())?; | |||
|
|||
warn!("Starting upload of partition objects"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn!("Starting upload of partition objects"); | |
info!("Starting upload of partition objects"); |
src/context.rs
Outdated
} | ||
|
||
let part_size = part_buffer.len(); | ||
warn!("Uploading part with {} bytes", part_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn!("Uploading part with {} bytes", part_size); | |
info!("Uploading part with {} bytes", part_size); |
(or even debug?)
src/lib.rs
Outdated
extern crate core; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extern crate core; |
src/context.rs
Outdated
Ok(size) => { | ||
if size == 0 && part_buffer.is_empty() { | ||
// We've reached EOF and there are no pending writes to flush | ||
// TODO: as per the docs size = 0 doesn't actually guarantee that we've reached EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true? Weird, since there's no other way to signal EOF (unless it returns a special Err type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is still true. I added a comment with a potential workaround using stream_len (nightly-only experimental API
atm).
I've also tried to get the true file size from metadata beforehand and keep track of the total read bytes, but they never matched somehow.
As a comparison, the current main show the following memory profiles for the two examples above: |
… of simultaneous multipart sending
Instead of loading up the entire file into memory and uploading it in one big chunk, perform a multipart upload of small buffered file chunks.