Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multipart upload: read directly from file #494

Open
2 tasks done
ALRBP opened this issue Mar 22, 2022 · 11 comments
Open
2 tasks done

Multipart upload: read directly from file #494

ALRBP opened this issue Mar 22, 2022 · 11 comments
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue

Comments

@ALRBP
Copy link

ALRBP commented Mar 22, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue, please leave a comment

Prerequisites

Question Description

I am currently developing a program that will (among various things) upload large files to S3.

I found how to perform normal uploads directly from files (without having to load them in RAM) in the doc, but I have an issue with multipart uploads (required for larger files).

If I understand things correctly, Client.put_object() will not perform a multipart upload and, unlike for other languages, the SDK does not provide a high-level API to do them automatically, you have to do them manually using Client.create_multipart_upload(), Client.upload_part() and Client.complete_multipart_upload().

The issue is that, while when performing a normal upload, I can just do .body(ByteStream::from_path("some/path"), it does not seem possible to do the same with a multipart upload. Since .body() must be called for each part, creating a new ByteStream from a path each time won't work (only the beginning of the file will be read) and I did not find a proper way to seek in a ByteStream (except reading individual bytes in a loop, which would lead to useless IO). The .take() function of the trait StreamExt sounds interesting, but its result is not usable as parameter for .body(), and simple trying to pass the same ByteStream at each iteration will not work, as the Rust compiler consider, likely for good reasons, that the ByteStream is "moved" when passing it to .body() (not to mention the fact that the function may not even know when to stop reading, since it is not clear for me whether .content_length() has an effect on stream reading). The only workaround I can imagine is to read the whole part from the file to RAM and use ByteStream::from(), but this is definitely not a proper way to perform an upload from a file.

So, did I miss something? Is there a proper way to perform a multipart upload reading data directly from a file, is there an issue with the API or is this a feature left for a future version?

I am not very experienced with Rust, and I am discovering the AWS SDK as well as the 3rd party libraries it relies on, so maybe it's just me who do not understand how to do things properly, but in that case, I would appreciate some help, in the other case, I hope this SDK could be improved to allow multipart uploads directly from a file.

Platform/OS/Device

GNU/Linux (amd64)

Language Version

Rust 1.59.0

@ALRBP ALRBP added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Mar 22, 2022
@jdisanti
Copy link
Contributor

We haven't implemented any high-level APIs for S3 yet, so as you've discovered, this is manual and tedious right now. I will convert this issue into a feature request so that people can upvote it, and in the meantime, I'll see if we can throw together an example.

@jdisanti jdisanti added feature-request A feature should be added or improved. and removed needs-triage This issue or PR still needs to be triaged. labels Mar 22, 2022
@Velfi Velfi removed the guidance Question that needs advice or information. label Apr 1, 2022
@darrow-olykos
Copy link

darrow-olykos commented Apr 25, 2022

Hi @Velfi and @jdisanti, out of curiosity, I started this trivial example (with only one part) (can wire up Rust logic that calls upload_part for each part of bytes read from large file after this is basic example)

pub async fn upload_object_with_multipart(
    client: &Client,
    bucket_name: &str,
    file_path: &str,
    object_key: &str,
) -> Result<(), Box<dyn Error>> {

    /// instantiate a multipart upload
    let create_multipart_upload_response = client
        .create_multipart_upload()
        .bucket(bucket_name)
        .key(object_key)
        .send()
        .await?;

    /// grab the upload_id that we'll use when uploading parts
    let upload_id = match create_multipart_upload_response.upload_id() {
        Some(upload_id) => upload_id.to_owned(),
        None => panic!("Something went wrong when creating multipart upload."),
    };

    /// upload a file in parts
    // TODO: after this trivial example works, demonstrate this with actual multiple parts, not just as one part (single byte stream)
    let body = match ByteStream::from_path(Path::new(file_path)).await {
        Ok(byte_stream) => byte_stream,
        Err(e) => panic!("{}", e),
    };

    let _upload_part_response = client
        .upload_part()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(&upload_id)
        .part_number(1)
        .body(body)
        .send()
        .await?;

    let list_parts_response = client
        .list_parts()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(&upload_id)
        .send()
        .await?;

    let parts_list = list_parts_response.parts();

    /// complete the multipart upload
    client
        .complete_multipart_upload()
        .bucket(bucket_name)
        .key(object_key)
        .upload_id(upload_id)
        // .body(parts_list) // ? .body() does not exist on fluent_builders::CompleteMultipartUpload, even though body is required according to AWS S3 CompleteMultipartUpload documentation
        .send()
        .await?;

    println!("Uploaded file in parts: {}", file_path);
    Ok(())
}

Is there a way to set the POST body on S3 client's complete_multipart_upload builder?
If not, is that necessary pre-work before creating an example with the AWS S3 Rust SDK is actionable?

Without the ability to set the HTTP POST body on the underlying HTTP call, the AWS S3 API will respond with an error like:

Error: ServiceError { err: CompleteMultipartUploadError { kind: Unhandled(Error { code: Some("InvalidRequest"), message: Some("You must specify at least one part"), 

Source: https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html#API_CompleteMultipartUpload_RequestSyntax

If producing an example is workable, I'd be happy to help do so. Please let me know if I'm missing some obvious method that I should be using instead.

P.S., by searching this error message on stackoverflow, it is easy to find that this caused some confusion for other language SDKs (for example PHP: https://stackoverflow.com/a/31413193/18257455 ), however those developers were able to resolve by setting an Array of parts to a MultipartUpload method that is available on PHP's completeMultipartUpload builder chain.

In this Rust SDK, there does not (at a glance) appear to be a way for an SDK user to construct the CompletedMultipartUpload object that looks to be equivalent method that "should" be able to be set to ultimately satisfy the AWS S3 API requirement (the POST request body for a CompleteMultipartUpload action must be set as described in link above).

If .multipart_upload(_) is the method on this complete_multipart_upload builder that enables us to set the list of parts that is required to send this HTTP POST request to the AWS S3 API, then the question remains:

How do we construct a CompletedMultipartUpload (note the ed) struct? (It is marked as non-exhaustive so must rely on a builder pattern, and cannot use struct literal syntax to instantiate?)

Thanks!

@rcoh
Copy link
Contributor

rcoh commented Apr 26, 2022

I believe this is the field you are missing: https://docs.rs/aws-sdk-s3/0.10.1/aws_sdk_s3/client/fluent_builders/struct.CompleteMultipartUpload.html#method.multipart_upload

it will be serialized into the body

@darrow-olykos
Copy link

darrow-olykos commented Apr 26, 2022

@rcoh Thanks! I took a look at the method again and see now how the SDK's fluent builder are structured. While this is a bit verbose, I at least have a working multipart example with one part:

(See #494 (comment) for the section of code that does the upload)

// just showing the parts relevant for the HTTP POST body serialization

/// get the list of uploaded parts
let parts_list = list_parts_response.parts().unwrap();

/// construct vec of CompletedParts (required to build CompletedMultipartUpload)
let completed_parts: Vec<aws_sdk_s3::model::CompletedPart> = parts_list
    .iter()
    .map(|part| {
        aws_sdk_s3::model::CompletedPart::builder()
            .e_tag(part.e_tag.as_ref().unwrap())
            .part_number(part.part_number)
            .build()
    })
    .collect();

/// construct CompletedMultipartUpload (required to construct CompleteMultipartUpload in a way that correctly serializes part list to HTTP POST body
let completed_multipart_upload = aws_sdk_s3::model::CompletedMultipartUpload::builder()
      .set_parts(Some(completed_parts))
      .build();

/// complete the multipart upload
client
    .complete_multipart_upload()
    .bucket(bucket_name)
    .key(object_key)
    .upload_id(upload_id)
    .multipart_upload(completed_multipart_upload)
    .send()
    .await?;
S3 client version: 0.10.1
Region:            us-east-1
Bucket:            test-bucket-for-multipart-1919
Object Key:        abcdef123123
File uploaded:     ./crates/s3/create_object/tests/testfile.txt

Uploaded file in parts: ./crates/s3/create_object/tests/testfile.txt

Next, this example can be cleaned up to demonstrate the Rust way of reading bytes of file incrementally and what that could look like. (My trivial multipart example only uses one "part")

If its OK, I'd like to take a stab at that tonight or tomorrow

@rcoh
Copy link
Contributor

rcoh commented Apr 26, 2022

nice! yeah this looks good so far. Note that the eventual home for this is https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/rust_dev_preview

the examples in this repo get built from their at release / codegen time

The multi-part will be a bit tricky until we add support for #519 — if you were interested in making a PR for that we'd gladly accept it.

@darrow-olykos
Copy link

darrow-olykos commented May 1, 2022

@rcoh Thank you for that info! Just curious, would you know someone who might know what the relationship is between aws_sdk_s3 and aws_smithy_http?

Context

Objective: Implement example AWS S3 Rust SDK Multipart upload (potentially in a simpler way than what I see in #519?) (Edit: I see now that the file-seeking logic that I see in #519 is what an SDK consumer will need to write if the SDK maintainer's goal is to add this from_file_chunk method on to ByteStream.)

Actions: I have an example of using AWS S3 SDK's ByteStream to read from a file and using StreamExt's try_next() to call S3's upload_part operation for each part, and I think things are OKAY except for the fact that the default buffer size is 4096 bytes, and the minimum part size that S3 API accepts is 5 MB.

Significant detail:
When inspecting source code of various crates, I noticed that this ByteStream shows up in two different AWS crates:

Why is this significant:
This seems significant to my objective because only one of these two otherwise identical ByteStream structs has a read_from method available which reveals an FsBuilder which is the type that enables me to override the default buffer size for the ByteStream to be something other than the default 4096 bytes. (I hope to use this to specify a size greater than the minimum 5 MB that AWS S3 UploadPart API requires). And the ByteStream struct that has this read_from method is not the ByteStream exposed by aws_sdk_s3.

Hopefully that makes sense, just started looking at this- I know AWS Rust SDK is only in preview mode but I'm excited to see it get used more!

Thanks!

@darrow-olykos
Copy link

Oh, whoops. You've answered this Q in #519: I see that aws-smithy-http is the source of truth and that aws-sdk-s3 is generated from it. I also see that the author of #519 added a from_file_chunk method to the ByteStream (just in aws-sdk-s3.)

@rcoh
Copy link
Contributor

rcoh commented May 2, 2022

the ByteStream struct is the same, it's just re-exported for convenience. aws_smithy_http is the crate which contains the core HTTP abstractions that the SDK is built on—SDK crates use it to actually send/receive HTTP requests & re-export some of its types for convenience.

@Velfi
Copy link
Contributor

Velfi commented May 3, 2022

I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.

@ALRBP
Copy link
Author

ALRBP commented May 3, 2022

I have a PR in progress that should make this much easier. It allows for setting a byte offset when creating a ByteStream from a file and also includes a test that demonstrates how you could dynamically divide a file into a given number of chunks.

Great!
I think this is exactly what is needed.

@forficate
Copy link

It'd be nice if the multipart upload took an AsyncRead instead of a file.

The AsyncRead would work with raw files and also allow compressing data from a file as a lazy stream when uploading to s3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

7 participants