Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10977 Globus Upload optimization - batch file size lookups #11040

Merged
merged 12 commits into from
Dec 2, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Nov 20, 2024

What this PR does / why we need it:

See the linked issue.

Which issue(s) this PR closes:

Special notes for your reviewer:

The guide entry explains what was done in the PR.
My own big concern (aside from a large portion of the code in GlobusServiceBean needing an overall refactoring) is that we have very few tests for the Globus functionality, and I am at a bit of loss how to create such tests. I may go ahead and open a separate issue for that effort.

Suggestions on how to test this:

Globus functionality can be tested on dataverse-internal. There is one NESE storage volume (labeled NESEtapeDemo) that's identical to that in prod., except it is connected to a NESE volume specifically reserved for testing. The instructions for the prod. end users should work exactly the same way on dataverse-internal.

The actual test would involve successfully uploading a large enough number of files in a single batch to trigger the new optimization. This would be 50+ files by default; alternatively, the :GlobusBatchLookupSize setting could be set to some lower number for this test.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

This comment has been minimized.

@landreev landreev marked this pull request as ready for review November 21, 2024 13:35
@landreev landreev added this to the 6.5 milestone Nov 21, 2024
@cmbz cmbz added FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) GREI 5 Use Cases labels Nov 21, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I have one question related to whether this changes the behavior for direct uploads (and maybe shouldn't) and one about whether the checks handle 0 length files. Otherwise I think this looks fine.

One other question - is it worth addressing #11030 in this. Globus related warnings for missing Bundle entries. Unrelated but maybe small enough to just drop in?

// to S3 that go through the jsf dataset page. Or the Globus
// uploads, where the file sizes are looked up in bulk on
// the completion of the remote upload task.
if (dataFile.getFilesize() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have 0 length files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so that line wasn't really breaking support for 0-size files - it was only potentially forcing Dataverse to make extra lookups of the size of 0-size files unnecessarily...

newCheckSum = optionalFileParams.getCheckSum();
newCheckSumType = optionalFileParams.getCheckSumType();
}
if (optionalFileParams.hasFileSize()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to make fileSize a valid param for direct uploads, etc. now too? I've been hesitant to trust the client about this (can I say my files are 10 bytes and get a lot of free storage?). Would it make sense to allow the Globus code to set this, e.g. passing a boolean through the addFiles call on #1092 that determines whether file size is read here? Not sure we need this or that this is the best way, but thought I'd raise the question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. To confirm: we'll be allowing Globus to do this, internally; but not accepting these entries in the json passed to /addFiles.

Also (somewhat unrelated) it would be fairly easy to implement a very similar mass lookup on an s3 folder inside an /addFiles call finalizing a large batch of direct s3 uploads. If it is ever registered to be a bottleneck. I never got around to measuring just how much these individual s3 lookups cost, but it always bothered me a little bit that we make separate network/http calls for them. (they definitely do not cost anything remotely approaching what I got with globus and NESE though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

re: S3: Makes sense. It is nice to have the size lookup in the S3AccessIO class as long as it is working. (I think we have switched to having one shared AWS client, which hopefully keeps an HTTP connection open, across S3AccessIO instances, so hopefully it's not too bad. That said, it wouldn't hurt to see how long it takes when its thousands of files.)

…the /addFiles API (i.e., we don't want to trust the users of the direct s3 upload api when it comes to file sizes). #10977

This comment has been minimized.

1 similar comment

This comment has been minimized.

…ndle, per #11030

(these are used by the notification page, apparently??)

This comment has been minimized.

This comment has been minimized.

@@ -362,13 +372,16 @@ public List<DataFile> saveAndAddFilesToDataset(DatasetVersion version,
if (fileSizeLimit == null || confirmedFileSize < fileSizeLimit) {

//set file size
logger.fine("Setting file size: " + confirmedFileSize);
dataFile.setFilesize(confirmedFileSize);
if (dataFile.getFilesize() < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did stare at this line last night. If getFilesize() == 0 at this point, so is confirmedFileSize, so this shouldn't really matter. But, why not.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think there's one more place to handle 0 size files, but I'll approve now.

@qqmyers qqmyers removed their assignment Nov 26, 2024
Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10977-globus-filesize-lookup
ghcr.io/gdcc/configbaker:10977-globus-filesize-lookup

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@ofahimIQSS ofahimIQSS self-assigned this Nov 27, 2024
@ofahimIQSS
Copy link
Contributor

Tested in Internal. No issues found.

Uploading Screen Recording 2024-12-02 at 4.43.15 PM.mov…

@ofahimIQSS ofahimIQSS merged commit f5f5cd2 into develop Dec 2, 2024
22 checks passed
@ofahimIQSS ofahimIQSS deleted the 10977-globus-filesize-lookup branch December 2, 2024 21:50
@ofahimIQSS ofahimIQSS removed their assignment Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) FY25 Sprint 11 FY25 Sprint 11 (2024-11-20 - 2024-12-04) GREI 5 Use Cases
Projects
None yet
4 participants