-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sending streams to s3manager uses gobs of ram #2591
Comments
Thanks for reaching out to us @erikh with the detailed information. The S3 Upload Manager does use significantly more memory than it should for streaming uploads. Refactoring the S3 Upload Manager is in our backlog to improve its performance. Simplifying the difference between single and multipart upload and streaming vs file uploads will help us improve the performance of the uploader. Additional improvements that can be made is to not compute the SHA of the object's payload at all, this is not required for HTTPS uploads. This does mean that the ContentMD5 would also not be set, which may need to be considered. We're looking at other techniques for ensuring data integrity metadata is available with the uploaded object such as trailing checksum which can be computed inline. Related to #2036 |
Sweet! Thanks for the detailed response!
I will follow the other ticket.
…On Mon, Jun 3, 2019 at 1:18 PM Jason Del Ponte ***@***.***> wrote:
Thanks for reaching out to us @erikh <https://github.com/erikh> with the
detailed information. The S3 Upload Manager does use significantly more
memory than it should for streaming uploads. Refactoring the S3 Upload
Manager is in our backlog to improve its performance. Simplifying the
difference between single and multipart upload and streaming vs file
uploads will help us improve the performance of the uploader.
Additional improvements that can be made is to not compute the SHA of the
object's payload at all, this is not required for HTTPS uploads. This does
mean that the ContentMD5 would also not be set, which may need to be
considered. We're looking at other techniques for ensuring data integrity
metadata is available with the uploaded object such as trailing checksum
which can be computed inline.
Related to #2036 <#2036>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2591?email_source=notifications&email_token=AAAET22YAZ7APHXA6OCLZ7TPYV4BBA5CNFSM4HK5S5WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2SFIY#issuecomment-498410147>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAET27WHK6OOWOEXJ5UTIDPYV4BBANCNFSM4HK5S5WA>
.
|
We recently introduced an improvement to the Upload manager that removes the usage of |
@skmcgrail Could you elaborate on why the part pool throws away all buffers inbetween uploads? This is causing my service to allocate like crazy when doing lots of streaming uploads even though I'm re-using the same The pooling would be a lot more useful if it was shared between calls to upload. I could make a P.R to address this if that would help? EDIT: alternatively making the part pool pluggable would be really helpful as well cause then I could at least supply my own implementation |
We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue. |
Version of AWS SDK for Go?
Master, verified in code ( see links ).
Version of Go (
go version
)?1.12.4
What issue did you see?
Large memory usage when sending a stream (not a physical file, will explain with links in a bit) with a multi-gigabyte multipart boundary to s3 via s3manager. The ram usage is roughly 3x the multipart size per concurrent upload.
The problem seems to be that your multipart upload implementation depends on io.ReadSeeker so that it can sha/md5 the content after the fact. This is further compounded by use of SectionReaders (Which requires ReadSeekers). Instead, I think using io.CopyN for the sections in the stream event, which could be combined with a hash.Hash stream reader for computing the sum values inline, would be much more space efficient; ditch the sync.Pool buffer implementation. From my progress meter measurements, much more time efficient too as the whole upload stalls while the buffer fills and actually gives me misleading data about the transfer rate as it's just directing it to ram -- with no option to measure the bandwidth usage myself inline.
I don't have time to do this; I'm just wiring up backups for a large data set. I just thought you should be aware if you weren't, and be willing to consider an alternative to your existing implementation to make the library more efficient. It's not really a complaint; but needing ~18GB to upload 500GB with part sizes that are large enough to make a file in S3 at that size, at 1Gbps is pretty upsetting.
Refs:
buffer pool initialized: https://github.com/aws/aws-sdk-go/blob/master/service/s3/s3manager/upload.go#L392-L394
stream (no seek capability) forces a buffer fill here: https://github.com/aws/aws-sdk-go/blob/master/service/s3/s3manager/upload.go#L451
this is where the seek implementation is actually required, and probably the only spot: https://github.com/aws/aws-sdk-go/blob/master/service/s3/body_hash.go#L122
Steps to reproduce
https://github.com/erikh/snowplow exhibits this behavior.
The text was updated successfully, but these errors were encountered: