9725 files lookup performance #9736

landreev · 2023-07-26T14:52:11Z

What this PR does / why we need it:

A few quick refinements to the optimizations added in #9558 to address some performance issues found during the testing of the 5.14 release candidate.

Which issue(s) this PR closes:

Closes #9725

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

… "findDeep" on the requested version, instead of the full dataset. (Needs more testing). #9725

qqmyers · 2023-07-26T15:16:20Z

src/main/java/edu/harvard/iq/dataverse/DatasetVersionServiceBean.java

+            // Optimization hints: retrieve all data in one query; this prevents point queries when iterating over the files 
+            .setHint("eclipselink.left-join-fetch", "o.fileMetadatas.dataFile.ingestRequest")
+            .setHint("eclipselink.left-join-fetch", "o.fileMetadatas.dataFile.thumbnailForDataset")
+            .setHint("eclipselink.left-join-fetch", "o.fileMetadatas.dataFile.dataTables")


Any idea whether things like this, that I think only get used when a file is displayed, would be better w/o the hint? I.e. would that allow getting datatables for the ten displayed files (as separate less efficient queries) rather than getting them for 10K files. (Bad example in that this is only for tabular files, but the same question for anything related to creating the view for a file versus deciding which files should be displayed.)

This is a very good point. I just haven't gotten around to finding a dataset with a ton of tabular files to confirm that this is indeed going to result in observable savings... But I can't think of a good reason not to drop this hint, no.
(will find a test dataset now)

And yes, I get the part that it's not just the DataTables.

So, does this really mean that we don't need the hint for roleassignments either? Or rather, that the only hints we need are for the 1:1 relations on DataFile? (that would otherwise result in additional individual queries?)

Sorry to report that dropping the DataTables hint does not appear to make things noticeably better for large datasets with many tab. files; but also makes things worse for those without any tab. files - individual lookup queries are then issued on every file , on account of this in the page, maybe:

for(DataFile f : dataset.getFiles()) { if(f.isTabularData()) { hasTabular = true; break; } }

Running out of steam with this thing a little bit.

(The awful loop above can be fixed of course; it needs to be fixed. I was just demoralized a little by seeing things like that. The above happens 40 lines into the init method. There's almost certainly more of the same throughout the rest of the page. Things that were already in 5.13, that had gradually crept into the code, that were making things slow on some large datasets in 5.13 and are offsetting the gains from the 5.14 optimizations now. I wasn't mentally prepared for going this deep. But will keep looking into it.)

Ah fun - needed for the download all button I guess. This type of thing definitely looks like a great little post 5.14 spin-out issue (that the SPA will also have).

Hmm. I became convinced last night that this really needs to be addressed as part of this PR. At least some of it, the more glaring stuff. The "awful loop" above - it is awful because it goes through every file in the entire dataset; negating the benefits of trying to only initialize one version-worth of files. AND it doesn't appear to be necessary - because for all the useful purposes the page and its fragment only need to know if tab files are present in the current version. (There is a separate loop for the tab files in the version there).
Dropping the hint for DataTable for the version doesn't appear to be as straightforward through - the page in its current form needs all the DataTables in the version, both for checking if there are tab files, and for calculating the "original" sizes, if tab files are present. Can (and probably should) be replaced with custom queries, but this may be an effort that can be delayed - will see how it goes.

... the full sentence should have been "negating the benefits of trying to only initialize one version-worth of files in some large datasets with sparseley-populated versions" of course.

I should mention that dropping most of the other hints on OneToMany objects in that method appears to be a less problematic affair. Thank you for the tip. Now seems like such an obvious thing in retrospect, of course.

scolapasta

Just some small comments / questions - nothing critical. Can you also ee the question @qqmeyers asked? I'm ready to approve, but want to let @landreev have a chance to review what we wrote before that. Let me know when it's set.

scolapasta · 2023-07-26T20:10:57Z

src/main/java/edu/harvard/iq/dataverse/DatasetPage.java

+
+                // We are only performing these lookups to obtain the database id
+                // of the version that we are displaying, and then we will use it
+                // to perform a .findFind(versionId); see below. 


small typo in the comment

scolapasta · 2023-07-26T20:14:46Z

src/main/java/edu/harvard/iq/dataverse/DatasetPage.java

                if (dataset == null) {
                    logger.warning("No such dataset: "+dataset);
                    return permissionsWrapper.notFound();
                }
                //retrieveDatasetVersionResponse = datasetVersionService.retrieveDatasetVersionById(dataset.getId(), version);
                retrieveDatasetVersionResponse = datasetVersionService.selectRequestedVersion(dataset.getVersions(), version);
+                if (retrieveDatasetVersionResponse == null) {


is this necessary? i.e. we already found the dataset, can there be one without a version here? (is this some edge case I'm not immediately seeing?)

I guess I see now that it was moved up from lower down. That said see my comment lower.

We may have had a real data corruption case where a dataset failed to get deleted, and was left in the database without versions. I wasn't thinking about it honestly, just feel like it's a good idea to check for nulls. (but yes, it was a copy-and-paste from further below)

scolapasta · 2023-07-26T20:18:44Z

src/main/java/edu/harvard/iq/dataverse/DatasetPage.java

@@ -1958,7 +1980,7 @@ private String init(boolean initFull) {
                return permissionsWrapper.notAuthorized();
            }

-            if (!retrieveDatasetVersionResponse.wasRequestedVersionRetrieved()) {
+            if (retrieveDatasetVersionResponse != null && !retrieveDatasetVersionResponse.wasRequestedVersionRetrieved()) {


Is this null check necessary? I guess its being extra safe, but you do check for mull further up, no?

This one IS actually necessary. Because in this PR I finally fixed the bug where the page was not working when called by versionId=. So yes, there can be a legit case where retrieveDatasetVersionResponse is going to be null here.
(and I figured when the page gets called by the numeric id, if it doesn't exist, it should always be 404/not found, "no substitutes").

…prod. datasets. #9725

landreev · 2023-07-27T17:56:23Z

I may have made enough progress addressing the performance on the remaining "outlier" - very high file count - prod. datasets, making it significantly better than under 5.13, as to finally stop (will need to confirm in QA that the numbers I'm seeing are legit, of course).
Thanks again to @qqmyers for the tip on dropping the unnecessary relations from the optimization query.

landreev · 2023-07-28T15:03:28Z

We should probably move this into QA now, so that the RC can be properly re-tested. Testing it thoroughly, especially on large datasets takes time.
When I said earlier that the performance on

"outlier" - very high file count - prod. datasets [is now] significantly better than under 5.13,

just as a reference, for that infamous 42K files dataset the page takes almost a minute to load in prod. On the perf. test instance (which is less beefy than prod.), it takes closer to 2 min. under 5.13, and it was even worse in the previous iteration of this branch. With the current code, it loads in 9 sec. on the perf. instance and I expect it to be below 5 sec. in prod. The real meaning of this is less about the optimization in the current branch being particularly good and more about how awful it was in 5.13. We apparently missed the point at which something had been added to the page that compromised the effectiveness of the earlier optimizations measures. I was wrong to even consider accepting a further drop in performance, if only on a few large datasets. We should be grateful to @ErykKul for bringing the performance of the page to our attention.

github-actions · 2023-07-28T15:23:34Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:9725-files-performance

ghcr.io/gdcc/configbaker:9725-files-performance

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev added 4 commits July 24, 2023 14:16

work in progress, ongoing performance experiments. #9725

5c0acbc

A potential solution? Makes the dataset page use a single query-based…

3672517

… "findDeep" on the requested version, instead of the full dataset. (Needs more testing). #9725

some refinements (#9725)

db23a03

Merge branch 'develop' into 9725-files-performance

7441272

scolapasta self-assigned this Jul 26, 2023