-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
10977 Globus Upload optimization - batch file size lookups #11040
Conversation
…le size lookup method for the entire batch. #10977
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I have one question related to whether this changes the behavior for direct uploads (and maybe shouldn't) and one about whether the checks handle 0 length files. Otherwise I think this looks fine.
One other question - is it worth addressing #11030 in this. Globus related warnings for missing Bundle entries. Unrelated but maybe small enough to just drop in?
// to S3 that go through the jsf dataset page. Or the Globus | ||
// uploads, where the file sizes are looked up in bulk on | ||
// the completion of the remote upload task. | ||
if (dataFile.getFilesize() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have 0 length files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so that line wasn't really breaking support for 0-size files - it was only potentially forcing Dataverse to make extra lookups of the size of 0-size files unnecessarily...
newCheckSum = optionalFileParams.getCheckSum(); | ||
newCheckSumType = optionalFileParams.getCheckSumType(); | ||
} | ||
if (optionalFileParams.hasFileSize()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to make fileSize a valid param for direct uploads, etc. now too? I've been hesitant to trust the client about this (can I say my files are 10 bytes and get a lot of free storage?). Would it make sense to allow the Globus code to set this, e.g. passing a boolean through the addFiles call on #1092 that determines whether file size is read here? Not sure we need this or that this is the best way, but thought I'd raise the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me. To confirm: we'll be allowing Globus to do this, internally; but not accepting these entries in the json passed to /addFiles
.
Also (somewhat unrelated) it would be fairly easy to implement a very similar mass lookup on an s3 folder inside an /addFiles
call finalizing a large batch of direct s3 uploads. If it is ever registered to be a bottleneck. I never got around to measuring just how much these individual s3 lookups cost, but it always bothered me a little bit that we make separate network/http calls for them. (they definitely do not cost anything remotely approaching what I got with globus and NESE though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
re: S3: Makes sense. It is nice to have the size lookup in the S3AccessIO class as long as it is working. (I think we have switched to having one shared AWS client, which hopefully keeps an HTTP connection open, across S3AccessIO instances, so hopefully it's not too bad. That said, it wouldn't hurt to see how long it takes when its thousands of files.)
…the /addFiles API (i.e., we don't want to trust the users of the direct s3 upload api when it comes to file sizes). #10977
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
…ndle, per #11030 (these are used by the notification page, apparently??)
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -362,13 +372,16 @@ public List<DataFile> saveAndAddFilesToDataset(DatasetVersion version, | |||
if (fileSizeLimit == null || confirmedFileSize < fileSizeLimit) { | |||
|
|||
//set file size | |||
logger.fine("Setting file size: " + confirmedFileSize); | |||
dataFile.setFilesize(confirmedFileSize); | |||
if (dataFile.getFilesize() < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
< 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did stare at this line last night. If getFilesize() == 0
at this point, so is confirmedFileSize
, so this shouldn't really matter. But, why not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think there's one more place to handle 0 size files, but I'll approve now.
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
Tested in Internal. No issues found. Uploading Screen Recording 2024-12-02 at 4.43.15 PM.mov… |
What this PR does / why we need it:
See the linked issue.
Which issue(s) this PR closes:
Special notes for your reviewer:
The guide entry explains what was done in the PR.
My own big concern (aside from a large portion of the code in GlobusServiceBean needing an overall refactoring) is that we have very few tests for the Globus functionality, and I am at a bit of loss how to create such tests. I may go ahead and open a separate issue for that effort.
Suggestions on how to test this:
Globus functionality can be tested on dataverse-internal. There is one NESE storage volume (labeled
NESEtapeDemo
) that's identical to that in prod., except it is connected to a NESE volume specifically reserved for testing. The instructions for the prod. end users should work exactly the same way on dataverse-internal.The actual test would involve successfully uploading a large enough number of files in a single batch to trigger the new optimization. This would be 50+ files by default; alternatively, the
:GlobusBatchLookupSize
setting could be set to some lower number for this test.Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: