Multipart upload: read directly from file #494

ALRBP · 2022-03-22T21:43:24Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue, please leave a comment

Prerequisites

I've searched for previous similar issues or discussions and didn't find any solution.
I searched the Developer Guide for an answer to my question.

Question Description

I am currently developing a program that will (among various things) upload large files to S3.

I found how to perform normal uploads directly from files (without having to load them in RAM) in the doc, but I have an issue with multipart uploads (required for larger files).

If I understand things correctly, Client.put_object() will not perform a multipart upload and, unlike for other languages, the SDK does not provide a high-level API to do them automatically, you have to do them manually using Client.create_multipart_upload(), Client.upload_part() and Client.complete_multipart_upload().

The issue is that, while when performing a normal upload, I can just do .body(ByteStream::from_path("some/path"), it does not seem possible to do the same with a multipart upload. Since .body() must be called for each part, creating a new ByteStream from a path each time won't work (only the beginning of the file will be read) and I did not find a proper way to seek in a ByteStream (except reading individual bytes in a loop, which would lead to useless IO). The .take() function of the trait StreamExt sounds interesting, but its result is not usable as parameter for .body(), and simple trying to pass the same ByteStream at each iteration will not work, as the Rust compiler consider, likely for good reasons, that the ByteStream is "moved" when passing it to .body() (not to mention the fact that the function may not even know when to stop reading, since it is not clear for me whether .content_length() has an effect on stream reading). The only workaround I can imagine is to read the whole part from the file to RAM and use ByteStream::from(), but this is definitely not a proper way to perform an upload from a file.

So, did I miss something? Is there a proper way to perform a multipart upload reading data directly from a file, is there an issue with the API or is this a feature left for a future version?

I am not very experienced with Rust, and I am discovering the AWS SDK as well as the 3rd party libraries it relies on, so maybe it's just me who do not understand how to do things properly, but in that case, I would appreciate some help, in the other case, I hope this SDK could be improved to allow multipart uploads directly from a file.

Platform/OS/Device

GNU/Linux (amd64)

Language Version

Rust 1.59.0

The text was updated successfully, but these errors were encountered:

jdisanti · 2022-03-22T22:32:11Z

We haven't implemented any high-level APIs for S3 yet, so as you've discovered, this is manual and tedious right now. I will convert this issue into a feature request so that people can upvote it, and in the meantime, I'll see if we can throw together an example.

darrow-olykos · 2022-04-25T19:00:01Z

Hi @Velfi and @jdisanti, out of curiosity, I started this trivial example (with only one part) (can wire up Rust logic that calls upload_part for each part of bytes read from large file after this is basic example)

pub async fn upload_object_with_multipart(
    client: &Client,
    bucket_name: &str,
    file_path: &str,
    object_key: &str,
) -> Result<(), Box<dyn Error>> {

    /// instantiate a multipart upload
    let create_multipart_upload_response = client
        .create_multipart_upload()
        .bucket(bucket_name)
        .key(object_key)
        .send()
        .await?;

    /// grab the upload_id that we'll use when uploading parts
    let upload_id = match create_multipart_upload_response.upload_id() {
        Some(upload_id) => upload_id.to_owned(),
        None => panic!("Something went wrong when creating multipart upload."),
    };

    /// upload a file in parts
    // TODO: after this trivial example works, demonstrate this with actual multiple parts, not just as one part (single byte stream)
    let body = match ByteStream::from_path(Path::new(file_path)).await {
        Ok(byte_stream) => byte_stream,
        Err(e) => panic!("{}", e),
    };

    let _upload_part_response = client
        .upload_part()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(&upload_id)
        .part_number(1)
        .body(body)
        .send()
        .await?;

    let list_parts_response = client
        .list_parts()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(&upload_id)
        .send()
        .await?;

    let parts_list = list_parts_response.parts();

    /// complete the multipart upload
    client
        .complete_multipart_upload()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(upload_id)
        // .body(parts_list) // ? .body() does not exist on fluent_builders::CompleteMultipartUpload, even though body is required according to AWS S3 CompleteMultipartUpload documentation
        .send()
        .await?;

    println!("Uploaded file in parts: {}", file_path);
    Ok(())
}

Is there a way to set the POST body on S3 client's complete_multipart_upload builder?
If not, is that necessary pre-work before creating an example with the AWS S3 Rust SDK is actionable?

Without the ability to set the HTTP POST body on the underlying HTTP call, the AWS S3 API will respond with an error like:

Error: ServiceError { err: CompleteMultipartUploadError { kind: Unhandled(Error { code: Some("InvalidRequest"), message: Some("You must specify at least one part"),

Source: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html#API_CompleteMultipartUpload_RequestSyntax

If producing an example is workable, I'd be happy to help do so. Please let me know if I'm missing some obvious method that I should be using instead.

P.S., by searching this error message on stackoverflow, it is easy to find that this caused some confusion for other language SDKs (for example PHP: https://stackoverflow.com/a/31413193/18257455 ), however those developers were able to resolve by setting an Array of parts to a MultipartUpload method that is available on PHP's completeMultipartUpload builder chain.

In this Rust SDK, there does not (at a glance) appear to be a way for an SDK user to construct the CompletedMultipartUpload object that looks to be equivalent method that "should" be able to be set to ultimately satisfy the AWS S3 API requirement (the POST request body for a CompleteMultipartUpload action must be set as described in link above).

If .multipart_upload(_) is the method on this complete_multipart_upload builder that enables us to set the list of parts that is required to send this HTTP POST request to the AWS S3 API, then the question remains:

How do we construct a CompletedMultipartUpload (note the ed) struct? (It is marked as non-exhaustive so must rely on a builder pattern, and cannot use struct literal syntax to instantiate?)

Thanks!

rcoh · 2022-04-26T14:01:47Z

I believe this is the field you are missing: https://docs.rs/aws-sdk-s3/0.10.1/aws_sdk_s3/client/fluent_builders/struct.CompleteMultipartUpload.html#method.multipart_upload

it will be serialized into the body

darrow-olykos · 2022-04-26T14:58:04Z

@rcoh Thanks! I took a look at the method again and see now how the SDK's fluent builder are structured. While this is a bit verbose, I at least have a working multipart example with one part:

(See #494 (comment) for the section of code that does the upload)

// just showing the parts relevant for the HTTP POST body serialization

/// get the list of uploaded parts
let parts_list = list_parts_response.parts().unwrap();

/// construct vec of CompletedParts (required to build CompletedMultipartUpload)
let completed_parts: Vec<aws_sdk_s3::model::CompletedPart> = parts_list
    .iter()
    .map(|part| {
        aws_sdk_s3::model::CompletedPart::builder()
            .e_tag(part.e_tag.as_ref().unwrap())
            .part_number(part.part_number)
            .build()
    })
    .collect();

/// construct CompletedMultipartUpload (required to construct CompleteMultipartUpload in a way that correctly serializes part list to HTTP POST body
let completed_multipart_upload = aws_sdk_s3::model::CompletedMultipartUpload::builder()
      .set_parts(Some(completed_parts))
      .build();

/// complete the multipart upload
client
    .complete_multipart_upload()
    .bucket(bucket_name)
    .key(object_key)
    .upload_id(upload_id)
    .multipart_upload(completed_multipart_upload)
    .send()
    .await?;

S3 client version: 0.10.1
Region:            us-east-1
Bucket:            test-bucket-for-multipart-1919
Object Key:        abcdef123123
File uploaded:     ./crates/s3/create_object/tests/testfile.txt

Uploaded file in parts: ./crates/s3/create_object/tests/testfile.txt

Next, this example can be cleaned up to demonstrate the Rust way of reading bytes of file incrementally and what that could look like. (My trivial multipart example only uses one "part")

If its OK, I'd like to take a stab at that tonight or tomorrow

rcoh · 2022-04-26T15:04:05Z

nice! yeah this looks good so far. Note that the eventual home for this is https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/rust_dev_preview

the examples in this repo get built from their at release / codegen time

The multi-part will be a bit tricky until we add support for #519 — if you were interested in making a PR for that we'd gladly accept it.

darrow-olykos · 2022-05-01T03:21:52Z

@rcoh Thank you for that info! Just curious, would you know someone who might know what the relationship is between aws_sdk_s3 and aws_smithy_http?

Context

Objective: Implement example AWS S3 Rust SDK Multipart upload (potentially in a simpler way than what I see in #519?) (Edit: I see now that the file-seeking logic that I see in #519 is what an SDK consumer will need to write if the SDK maintainer's goal is to add this from_file_chunk method on to ByteStream.)

Actions: I have an example of using AWS S3 SDK's ByteStream to read from a file and using StreamExt's try_next() to call S3's upload_part operation for each part, and I think things are OKAY except for the fact that the default buffer size is 4096 bytes, and the minimum part size that S3 API accepts is 5 MB.

Significant detail:
When inspecting source code of various crates, I noticed that this ByteStream shows up in two different AWS crates:

Why is this significant:
This seems significant to my objective because only one of these two otherwise identical ByteStream structs has a read_from method available which reveals an FsBuilder which is the type that enables me to override the default buffer size for the ByteStream to be something other than the default 4096 bytes. (I hope to use this to specify a size greater than the minimum 5 MB that AWS S3 UploadPart API requires). And the ByteStream struct that has this read_from method is not the ByteStream exposed by aws_sdk_s3.

Hopefully that makes sense, just started looking at this- I know AWS Rust SDK is only in preview mode but I'm excited to see it get used more!

Thanks!

darrow-olykos · 2022-05-01T03:37:50Z

Oh, whoops. You've answered this Q in #519: I see that aws-smithy-http is the source of truth and that aws-sdk-s3 is generated from it. I also see that the author of #519 added a from_file_chunk method to the ByteStream (just in aws-sdk-s3.)

rcoh · 2022-05-02T18:29:53Z

the ByteStream struct is the same, it's just re-exported for convenience. aws_smithy_http is the crate which contains the core HTTP abstractions that the SDK is built on—SDK crates use it to actually send/receive HTTP requests & re-export some of its types for convenience.

Velfi · 2022-05-03T18:25:05Z

I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.

ALRBP · 2022-05-03T19:52:27Z

I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.

Great!
I think this is exactly what is needed.

forficate · 2023-04-30T04:49:32Z

It'd be nice if the multipart upload took an AsyncRead instead of a file.

The AsyncRead would work with raw files and also allow compressing data from a file as a lazy stream when uploading to s3.

ALRBP added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Mar 22, 2022

jdisanti added feature-request A feature should be added or improved. and removed needs-triage This issue or PR still needs to be triaged. labels Mar 22, 2022

Velfi removed the guidance Question that needs advice or information. label Apr 1, 2022

darrow-olykos mentioned this issue Apr 25, 2022

[NEW EXAMPLE REQUEST] Rust sdk S3 multipart example awsdocs/aws-doc-sdk-examples#2892

Closed

Velfi mentioned this issue May 3, 2022

Update FsBuilder-based ByteStreams to support file offsets smithy-lang/smithy-rs#1361

Merged

2 tasks

Velfi mentioned this issue Jun 6, 2022

Allow non-retryable sources when uploading files to S3 #541

Open

2 tasks

jon-chuang mentioned this issue Jul 21, 2022

storage: high memory usage of compactor risingwavelabs/risingwave#3921

Closed

jmklix added the p2 This is a standard priority issue label Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multipart upload: read directly from file #494

Multipart upload: read directly from file #494

ALRBP commented Mar 22, 2022 •

edited by jdisanti

Loading

jdisanti commented Mar 22, 2022

darrow-olykos commented Apr 25, 2022 •

edited

Loading

rcoh commented Apr 26, 2022

darrow-olykos commented Apr 26, 2022 •

edited

Loading

rcoh commented Apr 26, 2022

darrow-olykos commented May 1, 2022 •

edited

Loading

darrow-olykos commented May 1, 2022

rcoh commented May 2, 2022

Velfi commented May 3, 2022

ALRBP commented May 3, 2022

forficate commented Apr 30, 2023

Multipart upload: read directly from file #494

Multipart upload: read directly from file #494

Comments

ALRBP commented Mar 22, 2022 • edited by jdisanti Loading

Community Note

Prerequisites

Question Description

Platform/OS/Device

Language Version

jdisanti commented Mar 22, 2022

darrow-olykos commented Apr 25, 2022 • edited Loading

rcoh commented Apr 26, 2022

darrow-olykos commented Apr 26, 2022 • edited Loading

rcoh commented Apr 26, 2022

darrow-olykos commented May 1, 2022 • edited Loading

darrow-olykos commented May 1, 2022

rcoh commented May 2, 2022

Velfi commented May 3, 2022

ALRBP commented May 3, 2022

forficate commented Apr 30, 2023

ALRBP commented Mar 22, 2022 •

edited by jdisanti

Loading

darrow-olykos commented Apr 25, 2022 •

edited

Loading

darrow-olykos commented Apr 26, 2022 •

edited

Loading

darrow-olykos commented May 1, 2022 •

edited

Loading