Added S3::Object#upload_stream #1711

Fonsan · 2018-02-11T00:03:02Z

This solves #1580 and #1351

s3 = Aws::S3::Resource.new
s3.bucket('aws-sdk').object('key').upload_stream do |write_stream|
  10.times { write_stream << 'foo' }
end

s3.bucket('aws-sdk').object('key').upload_stream do |write_stream|
  IO.copy_stream(IO.popen('ls'), write_stream)
end

s3.bucket('aws-sdk').object('key').upload_stream do |write_stream|
  IO.copy_stream(STDIN, write_stream)
end

Added a utility method for uploading a stream in a streaming manner to Amazon S3. This method automatically manages uploading large streams using the multipart upload APIs. The data does never touch disk which makes it a good fit for environments that have limited disk access. In theory it should also be faster compared to #put_object due to the concurrent implementation. I have left out the #multipart_threshold found in #upload_file but it could easily be added if needed.

A important aspect is that back pressure is implemented by causing #<< to block if the underlying upload resources are fully saturated to avoid boundless consumption of memory. I have not yet found a good way to test this.

The feature was greatly inspired by the existing #upload_file and get_object method and the style of implementation intentionally tries to match the existing style in the project.

A utility method for uploading chunks in a streaming manner to Amazon S3. This method automatically manages uploading large streams using the multipart upload APIs. The data does never touch disk which makes it a good fit for environments that have limited disk access. The feature was greatly inspired by the existing #upload_file method and the style of implementation intentionally tries to match the existing style.

Fonsan · 2018-02-11T00:17:20Z

I am playing with the thought of implementing a #upload_stream that would yield a io pipe enabling the use of IO.copy_stream and friends which offers built in back pressure.

Fonsan · 2018-02-11T09:44:05Z

The documentation should talk about memory usage, the default strategy uses 10 threads and if they are saturated they each are uploading 5MB of data which brings memory consumption to 50MB in #upload_stream. I am guessing that lowering the default number of threads will result performance losses in 90% of use cases where 50MB is readily available.

We could offer a strategy that has the same memory characteristics as #upload_file (a few KB by my estimates) by writing a tempfile for each part. But for most use-cases writing and reading via disk seems like a detour

Fonsan · 2018-02-11T14:18:41Z

tempfile feature is now implemented as a optional boolean which forces minimal memory usage by never keeping any full part content in memory.

part_size option has been added to customize the size of parts. I do not have any insight into what would be a good default so for now it is 5MB

Fonsan · 2018-02-11T21:50:10Z

The failing jruby spec fails due to a bug in jruby 9.1.5.0 it is however fixed from 9.1.15.0.
jruby/jruby@cd40c31

tmedford · 2018-02-14T06:04:31Z

Could we get this merged in?

tmedford · 2018-02-14T15:26:57Z

@Fonsan How would I reference your version in my local gemfile?

Fonsan · 2018-02-14T15:33:18Z

@tmedford Funnily enough I had the same problem and went down the rabbit hole that is how bundler searches for gemspecs. Unfortunately I found no way of using the :git directive since it bundler does not traverse the folder structure which would be needed to find gems/aws-sdk-s3/aws-sdk-s3.gemspec. If you check out the project locally and do gem build aws-sdk-s3.gemspec and then gem install path-to-produced.gem inside your projects gemset you might be able to get it to work.

tmedford · 2018-02-14T15:43:42Z

@Fonsan So if I absolutely needed the stream and tempfile, memory, I would have to deploy my own gem version and source from that until this is merged into repo and another release is pushed? Currently I am seeing a 200mb file take reach up to 600mb when using upload_file (multipart). Messed around with the threads to no avail. Also do you have a file example for the steam I could validate against.

cjyclaire · 2018-02-14T17:29:36Z

Thanks for the ping, will take a look shortly

Fonsan · 2018-02-14T18:15:45Z

@tmedford

So if I absolutely needed the stream and tempfile, memory, I would have to deploy my own gem version and source from that until this is merged into repo and another release is pushed?

There might be a better way but that is the suggestion I could come up with.

Currently I am seeing a 200mb file take reach up to 600mb when using upload_file (multipart). Messed around with the threads to no avail.

I would not have guessed that #upload_file would have produce a large memory footprint looking at the implementation. Given that the method has been around a few years it is unlikely that there is a correlation between the file size and memory foot print. It would be very interesting if you could come up with a clean example case that used excessive amounts of memory.

Also do you have a file example for the steam I could validate against.

A example usage of #upload_stream can be found in the step definitions here: https://github.com/aws/aws-sdk-ruby/pull/1711/files#diff-08088f982fdbffd39e194fe7bb233d09

Fonsan · 2018-02-24T20:06:50Z

@cjyclaire Let me know if you have any questions :)

Currently Shrine requires all IO objects to respond to #size. This is not necessary, because storages like FileSystem and S3 allow uploading IO objects of unknown size (with some modifications). The size of the IO object can be unknown if it's a pipe, socket, or a general stream of data. We remove #size from the required IO methods. We also deprecates the Shrine::IO_METHODS constant, because now that it's public we cannot remove :size from it without affecting backwards compatibility. Shrine::IO_METHODS will be private in Shrine 3. We modify S3 storage to use multipart upload in case the size is unknown. There is aws/aws-sdk-ruby#1711 which adds Aws::S3::Object#upload_stream, but it hasn't been merged yet at this moment, and we unfortunately don't control aws-sdk-s3 version so we have to support older versions anyway. We also modify FileSystem and S3 storages to backfill the "size" metadata of the IO object in case it's not known. Fixes #252

tmedford · 2018-03-16T15:40:18Z

@cjyclaire can we get an update

awood45 · 2018-03-16T16:34:24Z

Sorry for the silence, let me give an update. We're looking at this as a PR we intend to merge, however we're very limited on bandwidth at this point to give a proper review. It may be a week, or a few weeks, before we can review this. I can say it's fairly high on the priority list.

Fonsan · 2018-03-16T16:52:29Z

@awood45 great!

Fonsan · 2018-04-27T14:56:54Z

@awood45 What is the next step?

awood45 · 2018-06-11T20:01:11Z

As for the questions on memory usage, we've certainly had cases with our multipart support where performance tweaks were later needed. I'll take another pass with that in mind, but if it's functional and not causing regressions to existing methods, I'm more inclined to accept the PR.

awood45 · 2018-06-11T20:03:30Z

Nothing is standing out as a red flag memory usage wise from what I can read of the implementation. Integration tests are passing for me, so I'm going to merge this and it should go out with our next SDK release!

awood45 · 2018-06-11T20:29:33Z

Belay that, this is failing in JRuby (and not just the inability to install of the older version). I'm looking at why now.

awood45 · 2018-06-11T20:40:20Z

Looks like this line is sometimes problematic in JRuby - I'm working on isolating why: https://github.com/aws/aws-sdk-ruby/pull/1711/files#diff-739272f1b1c14844fd5d13e3e702850aR108

awood45 · 2018-06-11T20:47:40Z

Furthermore, in JRuby 9.1.6.0 where I was testing, this seems to be the offending line: https://github.com/jruby/jruby/blob/9.1.6.0/core/src/main/java/org/jruby/RubyIO.java#L4083

awood45 · 2018-06-11T20:53:15Z

I'm going to take some time to try and debug this - but if I can't figure out the cause of the breakage, I'll need to revert this change, and create a new PR (or let you re-submit this PR).

Fonsan · 2018-06-11T21:35:03Z

@tmedford Did you read my earlier comment #1711 (comment)

If keeping support for older patch versions of JRuby is critical then the equivalent without the #copy_stream call from the example below would be
bytes_copied = IO.copy_stream(read_pipe, temp_io, @part_size)
translates into

buffer = read_pipe.read(@part_size)
bytes_copied = buffer.bytesize
temp_io.write(buffer)

Fonsan · 2018-06-11T21:38:33Z

9.1.6.0 is 18 months old and there have been more than 10 patch releases for 9.1 since
http://jruby.org/2016/11/09/jruby-9-1-6-0.html

Where can I read up on the policy you have for supporting older patch releases of rubies?

Fonsan · 2018-06-11T21:39:55Z

@awood45 pinged the wrong guy above

awood45 · 2018-06-11T21:53:33Z

There's some evidence from chatter about JRuby's implementation of IO.copy_stream which indicates it may be bugged.

awood45 · 2018-06-11T21:53:58Z

jruby/jruby#4093

jruby/jruby#5048

awood45 · 2018-06-11T21:55:14Z

We do place importance on keeping old versions of Ruby/JRuby intact, we don't want to break customers or force upgrades whenever possible. I'll try dropping in that code equivalent - is there a reason for using IO.copy_stream which will regress from that equivalent code?

awood45 · 2018-06-11T21:58:49Z

That equivalent code fails often with undefined method bytesize' for nil:NilClass`

I'm investigating, but will need to resolve that.

awood45 · 2018-06-11T22:02:14Z

It does look like JRuby fixed this as of 9.1.16.0. So, we know that this is a JRuby standard library issue.

Fonsan · 2018-06-11T22:02:55Z

While I do agree on keeping compatibility or for old versions of any Ruby, the limit for me really goes at the minor version https://semver.org/ since any patch release may contain a bug for any special case writing code in a third party library that handles all historic patch versions of all minor versions would be near impossible.

I would urge you to reconsider the policy of keeping compatibility with anything than the latest patch version. One could conceive a scenario in which a bug in one patch version is followed up by another patch version with a different bug which yields no way conform to both versions without some obscure handling

awood45 · 2018-06-11T22:03:24Z

I'm not going to worry about reverting this in that case - it's new functionality, no regression has been introduced. But I don't want to trigger a release until we can get the right code working for older JRuby versions.

Fonsan · 2018-06-11T22:04:35Z

@awood45 I did some digging when posting this pr and #1711 (comment) is the commit that fixes the issue

Fonsan · 2018-06-11T22:05:48Z

My bad, I missed the case for reading when there was no data left to read

buffer = read_pipe.read(@part_size)
bytes_copied = if buffer
  temp_io.write(buffer)
  buffer.bytesize
else
  0
end

awood45 · 2018-06-11T22:08:05Z

I think that's actually reasonable in this specific case, but if possible, I'd like to find a solution which is going to work on these versions. It's not quite a non-starter since there is no forced upgrade here (if you don't opt-in to use this new functionality, you don't break), but ideally we can avoid forcing end-users to upgrade solely to maintain compatibility with their AWS SDK dependency.

Fonsan · 2018-06-11T22:08:33Z

Perhaps we should as an exercise run the current stable release tests against all historic patch versions of jruby.

My money is on multiple releases failing to pass the tests

Fonsan · 2018-06-11T22:09:49Z

I see your point and it is a noble effort to avoid forcing users to upgrade

awood45 · 2018-06-11T22:10:26Z

Your above posted change does seem to work, stand by.

awood45 · 2018-06-11T22:12:32Z

I think this is the right approach - I don't see anything additionally that IO.copy_stream was giving us other than brevity. I'll link the new PR to this PR.

Fonsan · 2018-06-11T22:13:09Z

The backwards compatible snippet above might introduce using more memory (twice if I think about it) for all versions and all Rubies

awood45 · 2018-06-11T22:14:13Z

That's an interesting issue to consider. I'm going to look into that before merging the linked PR.

Fonsan added 2 commits February 11, 2018 01:23

Revised changelog

963a4ec

Switched to #upload_stream

2b911e9

Fonsan changed the title ~~Added S3::Object#upload_chunks~~ Added S3::Object#upload_stream Feb 11, 2018

Fonsan added 2 commits February 11, 2018 02:03

Fixed ‘multiple assignment in conditional’ for older rubies

38e52b9

Remove close call

ef6551e

Fonsan added 4 commits February 11, 2018 14:32

Added :tempfile option to #upload_stream

086adc4

Simplify implementation by using string io.

c3536df

Added custom :part_size to #upload_stream

6669277

Fixed typo

18c8f81

Fonsan added 2 commits February 11, 2018 19:18

Revised spec to test part size

d39338c

Handle parts may completing out of order in tests due to concurrency

d31c225

Use block version of IO::pipe

6a17237

kstratis mentioned this pull request May 10, 2018

Allow for in-memroy multipart upload #1580

Closed

awood45 merged commit e8f3dee into aws:master Jun 11, 2018

awood45 mentioned this pull request Jun 11, 2018

Fix JRuby Issues with IO.copy_stream #1801

Merged

Fonsan mentioned this pull request Jun 17, 2018

Fixed and improved documentation for Object#upload_stream #1804

Merged

Fonsan mentioned this pull request Jan 29, 2019

Streaming Google storage upload googleapis/google-api-ruby-client#759

Open

simi mentioned this pull request Dec 8, 2020

WIP: Add streaming uploads. googleapis/google-cloud-ruby#8235

Closed

Added S3::Object#upload_stream #1711

Added S3::Object#upload_stream #1711

Conversation

Fonsan commented Feb 11, 2018 • edited Loading

Fonsan commented Feb 11, 2018

Fonsan commented Feb 11, 2018

Fonsan commented Feb 11, 2018

Fonsan commented Feb 11, 2018

tmedford commented Feb 14, 2018

tmedford commented Feb 14, 2018

Fonsan commented Feb 14, 2018

tmedford commented Feb 14, 2018 • edited Loading

cjyclaire commented Feb 14, 2018

Fonsan commented Feb 14, 2018 • edited Loading

Fonsan commented Feb 24, 2018

tmedford commented Mar 16, 2018

awood45 commented Mar 16, 2018

Fonsan commented Mar 16, 2018

Fonsan commented Apr 27, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

Fonsan commented Jun 11, 2018

Fonsan commented Jun 11, 2018

Fonsan commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

Fonsan commented Jun 11, 2018

awood45 commented Jun 11, 2018

Fonsan commented Jun 11, 2018

Fonsan commented Jun 11, 2018 • edited Loading

awood45 commented Jun 11, 2018

Fonsan commented Jun 11, 2018

Fonsan commented Jun 11, 2018

awood45 commented Jun 11, 2018

awood45 commented Jun 11, 2018

Fonsan commented Jun 11, 2018

awood45 commented Jun 11, 2018

Fonsan commented Feb 11, 2018 •

edited

Loading

tmedford commented Feb 14, 2018 •

edited

Loading

Fonsan commented Feb 14, 2018 •

edited

Loading

Fonsan commented Jun 11, 2018 •

edited

Loading