Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multipart partition object upload #99

Merged
merged 13 commits into from
Sep 9, 2022

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Sep 6, 2022

Instead of loading up the entire file into memory and uploading it in one big chunk, perform a multipart upload of small buffered file chunks.

@gruuya gruuya requested a review from mildbyte September 6, 2022 08:09
@gruuya gruuya self-assigned this Sep 6, 2022
@gruuya gruuya linked an issue Sep 6, 2022 that may be closed by this pull request
@gruuya
Copy link
Contributor Author

gruuya commented Sep 7, 2022

As of the last implementation, the memory profile for the dataset addition from the docs (13:08:21.452Z is the start of the actual multipart partition files upload):
image

However, when running

CREATE EXTERNAL TABLE area1 STORED AS PARQUET LOCATION '/home/ubuntu/area1.parquet';
CREATE TABLE area1 AS SELECT * FROM staging.area1;

via a psql terminal (using area1.parquet, with size about 2.45GB) I'm seeing:
image

src/context.rs Outdated
@@ -304,6 +311,8 @@ pub async fn plan_to_object_store(
}
writer.close().map_err(DataFusionError::from).map(|_| ())?;

warn!("Starting upload of partition objects");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warn!("Starting upload of partition objects");
info!("Starting upload of partition objects");

src/context.rs Outdated
}

let part_size = part_buffer.len();
warn!("Uploading part with {} bytes", part_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warn!("Uploading part with {} bytes", part_size);
info!("Uploading part with {} bytes", part_size);

(or even debug?)

src/lib.rs Outdated
Comment on lines 1 to 2
extern crate core;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extern crate core;

src/context.rs Outdated
Ok(size) => {
if size == 0 && part_buffer.is_empty() {
// We've reached EOF and there are no pending writes to flush
// TODO: as per the docs size = 0 doesn't actually guarantee that we've reached EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true? Weird, since there's no other way to signal EOF (unless it returns a special Err type)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is still true. I added a comment with a potential workaround using stream_len (nightly-only experimental API atm).

I've also tried to get the true file size from metadata beforehand and keep track of the total read bytes, but they never matched somehow.

@gruuya
Copy link
Contributor Author

gruuya commented Sep 7, 2022

As a comparison, the current main show the following memory profiles for the two examples above:
supply_chains
image

area1
image

@gruuya
Copy link
Contributor Author

gruuya commented Sep 7, 2022

Repeating profile once again (sanity check) for the current branch
supply_chains
image

area1
image

@gruuya
Copy link
Contributor Author

gruuya commented Sep 9, 2022

Another memory profile, with upload task semaphore, and object store multipart race fix
suply_chains
image

area1
image

@gruuya gruuya merged commit a311282 into main Sep 9, 2022
@gruuya gruuya deleted the partition-multipart-upload-cu-2v11y42 branch September 9, 2022 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workaround for having to load Parquet files in-memory before uploading them
2 participants