Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added S3::Object#upload_stream #1711

Merged
merged 13 commits into from
Jun 11, 2018
Merged

Conversation

Fonsan
Copy link
Contributor

@Fonsan Fonsan commented Feb 11, 2018

This solves #1580 and #1351

s3 = Aws::S3::Resource.new
s3.bucket('aws-sdk').object('key').upload_stream do |write_stream|
  10.times { write_stream << 'foo' }
end

s3.bucket('aws-sdk').object('key').upload_stream do |write_stream|
  IO.copy_stream(IO.popen('ls'), write_stream)
end

s3.bucket('aws-sdk').object('key').upload_stream do |write_stream|
  IO.copy_stream(STDIN, write_stream)
end

Added a utility method for uploading a stream in a streaming manner to Amazon S3. This method automatically manages uploading large streams using the multipart upload APIs. The data does never touch disk which makes it a good fit for environments that have limited disk access. In theory it should also be faster compared to #put_object due to the concurrent implementation. I have left out the #multipart_threshold found in #upload_file but it could easily be added if needed.

A important aspect is that back pressure is implemented by causing #<< to block if the underlying upload resources are fully saturated to avoid boundless consumption of memory. I have not yet found a good way to test this.

The feature was greatly inspired by the existing #upload_file and get_object method and the style of implementation intentionally tries to match the existing style in the project.

A utility method for uploading chunks in a streaming manner to Amazon S3. This method automatically manages uploading large streams using the multipart upload APIs. The data does never touch disk which makes it a good fit for environments that have limited disk access.

The feature was greatly inspired by the existing #upload_file method and the style of implementation intentionally tries to match the existing style.
@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 11, 2018

I am playing with the thought of implementing a #upload_stream that would yield a io pipe enabling the use of IO.copy_stream and friends which offers built in back pressure.

@Fonsan Fonsan changed the title Added S3::Object#upload_chunks Added S3::Object#upload_stream Feb 11, 2018
@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 11, 2018

The documentation should talk about memory usage, the default strategy uses 10 threads and if they are saturated they each are uploading 5MB of data which brings memory consumption to 50MB in #upload_stream. I am guessing that lowering the default number of threads will result performance losses in 90% of use cases where 50MB is readily available.

We could offer a strategy that has the same memory characteristics as #upload_file (a few KB by my estimates) by writing a tempfile for each part. But for most use-cases writing and reading via disk seems like a detour

@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 11, 2018

tempfile feature is now implemented as a optional boolean which forces minimal memory usage by never keeping any full part content in memory.

part_size option has been added to customize the size of parts. I do not have any insight into what would be a good default so for now it is 5MB

@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 11, 2018

The failing jruby spec fails due to a bug in jruby 9.1.5.0 it is however fixed from 9.1.15.0.
jruby/jruby@cd40c31

@tmedford
Copy link

Could we get this merged in?

@tmedford
Copy link

@Fonsan How would I reference your version in my local gemfile?

@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 14, 2018

@tmedford Funnily enough I had the same problem and went down the rabbit hole that is how bundler searches for gemspecs. Unfortunately I found no way of using the :git directive since it bundler does not traverse the folder structure which would be needed to find gems/aws-sdk-s3/aws-sdk-s3.gemspec. If you check out the project locally and do gem build aws-sdk-s3.gemspec and then gem install path-to-produced.gem inside your projects gemset you might be able to get it to work.

@tmedford
Copy link

tmedford commented Feb 14, 2018

@Fonsan So if I absolutely needed the stream and tempfile, memory, I would have to deploy my own gem version and source from that until this is merged into repo and another release is pushed? Currently I am seeing a 200mb file take reach up to 600mb when using upload_file (multipart). Messed around with the threads to no avail. Also do you have a file example for the steam I could validate against.

@cjyclaire
Copy link
Contributor

Thanks for the ping, will take a look shortly

@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 14, 2018

@tmedford

So if I absolutely needed the stream and tempfile, memory, I would have to deploy my own gem version and source from that until this is merged into repo and another release is pushed?

There might be a better way but that is the suggestion I could come up with.

Currently I am seeing a 200mb file take reach up to 600mb when using upload_file (multipart). Messed around with the threads to no avail.

I would not have guessed that #upload_file would have produce a large memory footprint looking at the implementation. Given that the method has been around a few years it is unlikely that there is a correlation between the file size and memory foot print. It would be very interesting if you could come up with a clean example case that used excessive amounts of memory.

Also do you have a file example for the steam I could validate against.

A example usage of #upload_stream can be found in the step definitions here: https://github.com/aws/aws-sdk-ruby/pull/1711/files#diff-08088f982fdbffd39e194fe7bb233d09

@Fonsan
Copy link
Contributor Author

Fonsan commented Feb 24, 2018

@cjyclaire Let me know if you have any questions :)

janko added a commit to shrinerb/shrine that referenced this pull request Mar 3, 2018
Currently Shrine requires all IO objects to respond to #size. This is
not necessary, because storages like FileSystem and S3 allow uploading
IO objects of unknown size (with some modifications). The size of the IO
object can be unknown if it's a pipe, socket, or a general stream of
data.

We remove #size from the required IO methods. We also deprecates the
Shrine::IO_METHODS constant, because now that it's public we cannot
remove :size from it without affecting backwards compatibility.
Shrine::IO_METHODS will be private in Shrine 3.

We modify S3 storage to use multipart upload in case the size is
unknown. There is aws/aws-sdk-ruby#1711 which
adds Aws::S3::Object#upload_stream, but it hasn't been merged yet at
this moment, and we unfortunately don't control aws-sdk-s3 version so
we have to support older versions anyway.

We also modify FileSystem and S3 storages to backfill the "size"
metadata of the IO object in case it's not known.

Fixes #252
@tmedford
Copy link

@cjyclaire can we get an update

@awood45
Copy link
Member

awood45 commented Mar 16, 2018

Sorry for the silence, let me give an update. We're looking at this as a PR we intend to merge, however we're very limited on bandwidth at this point to give a proper review. It may be a week, or a few weeks, before we can review this. I can say it's fairly high on the priority list.

@Fonsan
Copy link
Contributor Author

Fonsan commented Mar 16, 2018

@awood45 great!

@Fonsan
Copy link
Contributor Author

Fonsan commented Apr 27, 2018

@awood45 What is the next step?

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

As for the questions on memory usage, we've certainly had cases with our multipart support where performance tweaks were later needed. I'll take another pass with that in mind, but if it's functional and not causing regressions to existing methods, I'm more inclined to accept the PR.

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

Nothing is standing out as a red flag memory usage wise from what I can read of the implementation. Integration tests are passing for me, so I'm going to merge this and it should go out with our next SDK release!

@awood45 awood45 merged commit e8f3dee into aws:master Jun 11, 2018
@awood45
Copy link
Member

awood45 commented Jun 11, 2018

Belay that, this is failing in JRuby (and not just the inability to install of the older version). I'm looking at why now.

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

Looks like this line is sometimes problematic in JRuby - I'm working on isolating why: https://github.com/aws/aws-sdk-ruby/pull/1711/files#diff-739272f1b1c14844fd5d13e3e702850aR108

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

Furthermore, in JRuby 9.1.6.0 where I was testing, this seems to be the offending line: https://github.com/jruby/jruby/blob/9.1.6.0/core/src/main/java/org/jruby/RubyIO.java#L4083

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

I'm going to take some time to try and debug this - but if I can't figure out the cause of the breakage, I'll need to revert this change, and create a new PR (or let you re-submit this PR).

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

@tmedford Did you read my earlier comment #1711 (comment)

If keeping support for older patch versions of JRuby is critical then the equivalent without the #copy_stream call from the example below would be
bytes_copied = IO.copy_stream(read_pipe, temp_io, @part_size)
translates into

buffer = read_pipe.read(@part_size)
bytes_copied = buffer.bytesize
temp_io.write(buffer)

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

9.1.6.0 is 18 months old and there have been more than 10 patch releases for 9.1 since
http://jruby.org/2016/11/09/jruby-9-1-6-0.html

Where can I read up on the policy you have for supporting older patch releases of rubies?

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

@awood45 pinged the wrong guy above

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

There's some evidence from chatter about JRuby's implementation of IO.copy_stream which indicates it may be bugged.

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

We do place importance on keeping old versions of Ruby/JRuby intact, we don't want to break customers or force upgrades whenever possible. I'll try dropping in that code equivalent - is there a reason for using IO.copy_stream which will regress from that equivalent code?

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

That equivalent code fails often with undefined method bytesize' for nil:NilClass`

I'm investigating, but will need to resolve that.

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

It does look like JRuby fixed this as of 9.1.16.0. So, we know that this is a JRuby standard library issue.

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

While I do agree on keeping compatibility or for old versions of any Ruby, the limit for me really goes at the minor version https://semver.org/ since any patch release may contain a bug for any special case writing code in a third party library that handles all historic patch versions of all minor versions would be near impossible.

I would urge you to reconsider the policy of keeping compatibility with anything than the latest patch version. One could conceive a scenario in which a bug in one patch version is followed up by another patch version with a different bug which yields no way conform to both versions without some obscure handling

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

I'm not going to worry about reverting this in that case - it's new functionality, no regression has been introduced. But I don't want to trigger a release until we can get the right code working for older JRuby versions.

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

@awood45 I did some digging when posting this pr and #1711 (comment) is the commit that fixes the issue

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

My bad, I missed the case for reading when there was no data left to read

buffer = read_pipe.read(@part_size)
bytes_copied = if buffer
  temp_io.write(buffer)
  buffer.bytesize
else
  0
end

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

I think that's actually reasonable in this specific case, but if possible, I'd like to find a solution which is going to work on these versions. It's not quite a non-starter since there is no forced upgrade here (if you don't opt-in to use this new functionality, you don't break), but ideally we can avoid forcing end-users to upgrade solely to maintain compatibility with their AWS SDK dependency.

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

Perhaps we should as an exercise run the current stable release tests against all historic patch versions of jruby.

My money is on multiple releases failing to pass the tests

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

I see your point and it is a noble effort to avoid forcing users to upgrade

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

Your above posted change does seem to work, stand by.

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

I think this is the right approach - I don't see anything additionally that IO.copy_stream was giving us other than brevity. I'll link the new PR to this PR.

@Fonsan
Copy link
Contributor Author

Fonsan commented Jun 11, 2018

The backwards compatible snippet above might introduce using more memory (twice if I think about it) for all versions and all Rubies

@awood45
Copy link
Member

awood45 commented Jun 11, 2018

That's an interesting issue to consider. I'm going to look into that before merging the linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants