-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added S3::Object#upload_stream #1711
Conversation
A utility method for uploading chunks in a streaming manner to Amazon S3. This method automatically manages uploading large streams using the multipart upload APIs. The data does never touch disk which makes it a good fit for environments that have limited disk access. The feature was greatly inspired by the existing #upload_file method and the style of implementation intentionally tries to match the existing style.
I am playing with the thought of implementing a |
The documentation should talk about memory usage, the default strategy uses 10 threads and if they are saturated they each are uploading 5MB of data which brings memory consumption to 50MB in We could offer a strategy that has the same memory characteristics as |
|
The failing jruby spec fails due to a bug in jruby 9.1.5.0 it is however fixed from 9.1.15.0. |
Could we get this merged in? |
@Fonsan How would I reference your version in my local gemfile? |
@tmedford Funnily enough I had the same problem and went down the rabbit hole that is how bundler searches for gemspecs. Unfortunately I found no way of using the |
@Fonsan So if I absolutely needed the stream and tempfile, memory, I would have to deploy my own gem version and source from that until this is merged into repo and another release is pushed? Currently I am seeing a 200mb file take reach up to 600mb when using upload_file (multipart). Messed around with the threads to no avail. Also do you have a file example for the steam I could validate against. |
Thanks for the ping, will take a look shortly |
There might be a better way but that is the suggestion I could come up with.
I would not have guessed that
A example usage of |
@cjyclaire Let me know if you have any questions :) |
Currently Shrine requires all IO objects to respond to #size. This is not necessary, because storages like FileSystem and S3 allow uploading IO objects of unknown size (with some modifications). The size of the IO object can be unknown if it's a pipe, socket, or a general stream of data. We remove #size from the required IO methods. We also deprecates the Shrine::IO_METHODS constant, because now that it's public we cannot remove :size from it without affecting backwards compatibility. Shrine::IO_METHODS will be private in Shrine 3. We modify S3 storage to use multipart upload in case the size is unknown. There is aws/aws-sdk-ruby#1711 which adds Aws::S3::Object#upload_stream, but it hasn't been merged yet at this moment, and we unfortunately don't control aws-sdk-s3 version so we have to support older versions anyway. We also modify FileSystem and S3 storages to backfill the "size" metadata of the IO object in case it's not known. Fixes #252
@cjyclaire can we get an update |
Sorry for the silence, let me give an update. We're looking at this as a PR we intend to merge, however we're very limited on bandwidth at this point to give a proper review. It may be a week, or a few weeks, before we can review this. I can say it's fairly high on the priority list. |
@awood45 great! |
@awood45 What is the next step? |
As for the questions on memory usage, we've certainly had cases with our multipart support where performance tweaks were later needed. I'll take another pass with that in mind, but if it's functional and not causing regressions to existing methods, I'm more inclined to accept the PR. |
Nothing is standing out as a red flag memory usage wise from what I can read of the implementation. Integration tests are passing for me, so I'm going to merge this and it should go out with our next SDK release! |
Belay that, this is failing in JRuby (and not just the inability to install of the older version). I'm looking at why now. |
Looks like this line is sometimes problematic in JRuby - I'm working on isolating why: https://github.com/aws/aws-sdk-ruby/pull/1711/files#diff-739272f1b1c14844fd5d13e3e702850aR108 |
Furthermore, in JRuby 9.1.6.0 where I was testing, this seems to be the offending line: https://github.com/jruby/jruby/blob/9.1.6.0/core/src/main/java/org/jruby/RubyIO.java#L4083 |
I'm going to take some time to try and debug this - but if I can't figure out the cause of the breakage, I'll need to revert this change, and create a new PR (or let you re-submit this PR). |
@tmedford Did you read my earlier comment #1711 (comment) If keeping support for older patch versions of JRuby is critical then the equivalent without the buffer = read_pipe.read(@part_size)
bytes_copied = buffer.bytesize
temp_io.write(buffer) |
9.1.6.0 is 18 months old and there have been more than 10 patch releases for 9.1 since Where can I read up on the policy you have for supporting older patch releases of rubies? |
@awood45 pinged the wrong guy above |
There's some evidence from chatter about JRuby's implementation of |
We do place importance on keeping old versions of Ruby/JRuby intact, we don't want to break customers or force upgrades whenever possible. I'll try dropping in that code equivalent - is there a reason for using |
That equivalent code fails often with I'm investigating, but will need to resolve that. |
It does look like JRuby fixed this as of |
While I do agree on keeping compatibility or for old versions of any Ruby, the limit for me really goes at the minor version https://semver.org/ since any patch release may contain a bug for any special case writing code in a third party library that handles all historic patch versions of all minor versions would be near impossible. I would urge you to reconsider the policy of keeping compatibility with anything than the latest patch version. One could conceive a scenario in which a bug in one patch version is followed up by another patch version with a different bug which yields no way conform to both versions without some obscure handling |
I'm not going to worry about reverting this in that case - it's new functionality, no regression has been introduced. But I don't want to trigger a release until we can get the right code working for older JRuby versions. |
@awood45 I did some digging when posting this pr and #1711 (comment) is the commit that fixes the issue |
My bad, I missed the case for reading when there was no data left to read buffer = read_pipe.read(@part_size)
bytes_copied = if buffer
temp_io.write(buffer)
buffer.bytesize
else
0
end |
I think that's actually reasonable in this specific case, but if possible, I'd like to find a solution which is going to work on these versions. It's not quite a non-starter since there is no forced upgrade here (if you don't opt-in to use this new functionality, you don't break), but ideally we can avoid forcing end-users to upgrade solely to maintain compatibility with their AWS SDK dependency. |
Perhaps we should as an exercise run the current stable release tests against all historic patch versions of jruby. My money is on multiple releases failing to pass the tests |
I see your point and it is a noble effort to avoid forcing users to upgrade |
Your above posted change does seem to work, stand by. |
I think this is the right approach - I don't see anything additionally that IO.copy_stream was giving us other than brevity. I'll link the new PR to this PR. |
The backwards compatible snippet above might introduce using more memory (twice if I think about it) for all versions and all Rubies |
That's an interesting issue to consider. I'm going to look into that before merging the linked PR. |
This solves #1580 and #1351
Added a utility method for uploading a stream in a streaming manner to Amazon S3. This method automatically manages uploading large streams using the multipart upload APIs. The data does never touch disk which makes it a good fit for environments that have limited disk access. In theory it should also be faster compared to
#put_object
due to the concurrent implementation. I have left out the#multipart_threshold
found in#upload_file
but it could easily be added if needed.A important aspect is that back pressure is implemented by causing
#<<
to block if the underlying upload resources are fully saturated to avoid boundless consumption of memory. I have not yet found a good way to test this.The feature was greatly inspired by the existing
#upload_file
andget_object
method and the style of implementation intentionally tries to match the existing style in the project.