Faster combined query for retrieving datasets via API #9684

ErykKul · 2023-06-28T08:41:12Z

What this PR does / why we need it:
It reduces the number of queries needed for retrieving a dataset via API. Especially for large dataset, but even for smaller datasets when there is large traffic on the server, this makes the usage of resources more efficient.

Which issue(s) this PR closes:

Closes #9683

Observed time for retrieving a dataset with 10000 files went from 1 minute to 35 seconds with this PR. Real life applications would be less spectacular, but still useful.

coveralls · 2023-06-28T08:43:52Z

coverage: 20.58% (-0.004%) from 20.584%
when pulling fb4eb8b on ErykKul:9683_get_dataset_api_in_single_query
into 2bf05c1 on IQSS:develop.

ErykKul · 2023-06-29T10:44:34Z

I have run the integration tests locally, and they have passed. I have no idea why the tests failed on jenkins. Can someone check it for me? Thanks!

ErykKul · 2023-07-04T13:18:32Z

Tests passed, it looks OK now.

ErykKul · 2023-09-01T07:55:36Z

I have closed it: it is better to go for a real solution, this one is more a workaround. Also, it is not essential and does require merging...

ErykKul · 2023-09-01T14:59:04Z

Reopened and merged.

ErykKul · 2023-10-20T16:43:56Z

@pdurbin
I did not investigate deeply, but it looks to me that this pull request might be still useful. Little bit unexpected since the linked issue #9763 is closed. If it is still useful, let me know and I will merge the latest develop branch to it, and resolve the conflicts. If I am missing something and it can be closed, feel free to close this without merging.

pdurbin · 2023-10-20T18:36:57Z

@ErykKul I'm not the strongest with databases so when I see eclipselink.left-join-fetch my inclination is to let another developer here take a look: @scolapasta @landreev or @sekmiller

src/main/java/edu/harvard/iq/dataverse/api/AbstractApiBean.java

qqmyers · 2024-04-04T14:40:41Z

src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

@@ -127,7 +127,8 @@ public Dataset findDeep(Object pk) {
            .setHint("eclipselink.left-join-fetch", "o.files.dataFileTags")
            .setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas")
            .setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.fileCategories")
-            //.setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses")
+            .setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.varGroups")


Is this part of the optimization here? I would guess this isn't often accessed, so I'm not sure how much of a performance help it would be.

I had seen in in the query log even when it was not used. For large datasets it does make a difference (one query less for each file).

qqmyers

Overall, looks OK. I requested a couple changes and there are merge conflicts to resolve, but the overall change to get the 'deep' version of a dataset for the GET api/datasets/{id} looks fine. Once this is updated, should be ready to go.

pdurbin · 2024-04-09T18:01:14Z

@ErykKul heads up that there are merge conflicts.

ErykKul · 2024-04-19T14:59:19Z

@qqmyers I merged the devlop branch into this PR and done some changes.
@jp-tosca I have seen that you are doing some fixes and tests for large datasets. Can you test this PR with your test case? Also, this PR might have impact on the SPA. The API calls that are "performance tuned" (are you seeing any difference in performance?) are in the Datasets API:

GET path:{id}/versions/{versionId}/files
GET path:{id}

I hope I did not break anything...

jp-tosca · 2024-04-19T15:10:57Z

👋🏼 @ErykKul I created a large Dataset at https://beta.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/F1MCGB and also published the tools on https://github.com/IQSS/dataverse-sample-data to create this so let me know if that is helpful 😄

ErykKul · 2024-04-22T11:32:55Z

@jp-tosca I just tested with a dataset with 10000 files, like I did for other PRs. Getting that dataset on the current develop branch took 24 seconds. With this PR it took 8 seconds, 3 times faster! I think this PR aged well (it is from 10 months ago). I will try the {id}/versions/{versionId}/files call with the new parameters (limit and offset). Is this one used by the SPA?

ErykKul · 2024-04-22T12:00:01Z

The files API got worse: for 10 files from a 10000 files dataset it went from 100 ms to 3 seconds. For getting all 10000 files it went from 22 seconds to 25 seconds. It looks like it adds 3 seconds to the result (I guess it is the time needed for find deep to finish?). I will remove the find deep from that method, and leave it only for getting the dataset.

poikilotherm · 2024-04-22T12:50:57Z

Is it just me or would some Continuous Benchmarking be good here?

https://github.com/marketplace/actions/continuous-benchmark

ErykKul · 2024-04-22T13:53:43Z

It looks great! It would be very useful for some of the frequently used calls, like getting datasets. Probably out of scope for this PR. I am not sure why integration tests fail now, but I have seen on Slack that there might be a problem with Jenkins. Other than that, I think that this PR can go to QA?

jp-tosca · 2024-04-22T13:57:02Z

@jp-tosca I just tested with a dataset with 10000 files, like I did for other PRs. Getting that dataset on the current develop branch took 24 seconds. With this PR it took 8 seconds, 3 times faster! I think this PR aged well (it is from 10 months ago). I will try the {id}/versions/{versionId}/files call with the new parameters (limit and offset). Is this one used by the SPA?

I would ask maybe @GPortas or @ekraffmiller to check what method is used by the SPA 😄

ErykKul · 2024-05-02T14:08:52Z

@qqmyers Only one API call is impacted and it does work faster, regardless of from where it is called. I think that this PR can go to "ready for QA"

ErykKul · 2024-05-02T15:06:26Z

Jenkins failed, I am trying to figure out how to check why (I remember that there was a way to see the build log).

ErykKul · 2024-05-06T12:03:56Z

I have checked the action log, it looks like all tests passed, unit and integration tests (maybe I am missing something):

qqmyers

This looks OK. For some reason, the last run couldn't launch the AWS machine so the IT tests didn't run, so perhaps a rerun should get triggered before this passes QA.

combined query for retrieving datasets with API

2968ba2

ErykKul and others added 3 commits June 28, 2023 11:33

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

c8a1352

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

cdc3e3d

better error handling - should fix failed integration test

6083ead

find deep in file listing

8e75241

ErykKul closed this Sep 1, 2023

ErykKul reopened this Sep 1, 2023

Merge branch 'develop' into 9683_get_dataset_api_in_single_query

ca5b28c

ErykKul mentioned this pull request Sep 1, 2023

Performance: Slow response for the versions API call with large number of files or versions #9763

Closed

ErykKul added 2 commits September 6, 2023 10:51

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

2f98f23

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

53d0d85

pdurbin added the Feature: Performance & Stability label Oct 12, 2023

pdurbin added Size: 3 A percentage of a sprint. 2.1 hours. Component: Code Infrastructure formerly "Feature: Code Infrastructure" Type: Feature a feature request labels Feb 28, 2024

pdurbin changed the title ~~combined query for retrieving datasets with API~~ Faster combined query for retrieving datasets via API Feb 28, 2024

qqmyers reviewed Apr 4, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/AbstractApiBean.java Show resolved Hide resolved

qqmyers reviewed Apr 4, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/AbstractApiBean.java Outdated Show resolved Hide resolved

qqmyers reviewed Apr 4, 2024

View reviewed changes

qqmyers requested changes Apr 4, 2024

View reviewed changes

cmbz assigned ErykKul Apr 10, 2024

merged develop branch

9599f44

ErykKul added 2 commits April 17, 2024 16:16

fixed error type: badRequest -> notFound

773e2e2

if global Id not found, try alt global Id

6d0a1ba

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

f558e33

find deep disabled for the files API

60c53af

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

1eea73c

removed newline

3b97add

qqmyers approved these changes May 6, 2024

View reviewed changes

qqmyers unassigned ErykKul May 6, 2024

sekmiller self-assigned this May 23, 2024

Merge branch 'IQSS:develop' into 9683_get_dataset_api_in_single_query

fb4eb8b

sekmiller merged commit 1f9a682 into IQSS:develop May 24, 2024
10 of 11 checks passed

pdurbin added this to the 6.3 milestone May 28, 2024

This was referenced May 30, 2024

HarvestingServerIT.testSingleRecordOaiSet failing after PR 9684 was merged #10599

Closed

Revert "Faster combined query for retrieving datasets via API" #10600

Closed

catch exceptions from DatasetServiceBean.findDeep #10601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster combined query for retrieving datasets via API #9684

Faster combined query for retrieving datasets via API #9684

ErykKul commented Jun 28, 2023

coveralls commented Jun 28, 2023 •

edited

Loading

ErykKul commented Jun 29, 2023

ErykKul commented Jul 4, 2023

ErykKul commented Sep 1, 2023

ErykKul commented Sep 1, 2023

ErykKul commented Oct 20, 2023

pdurbin commented Oct 20, 2023

qqmyers Apr 4, 2024

ErykKul Apr 17, 2024

qqmyers left a comment

pdurbin commented Apr 9, 2024

ErykKul commented Apr 19, 2024

jp-tosca commented Apr 19, 2024

ErykKul commented Apr 22, 2024

ErykKul commented Apr 22, 2024

poikilotherm commented Apr 22, 2024

ErykKul commented Apr 22, 2024

jp-tosca commented Apr 22, 2024

ErykKul commented May 2, 2024

ErykKul commented May 2, 2024

ErykKul commented May 6, 2024

qqmyers left a comment

Faster combined query for retrieving datasets via API #9684

Faster combined query for retrieving datasets via API #9684

Conversation

ErykKul commented Jun 28, 2023

coveralls commented Jun 28, 2023 • edited Loading

ErykKul commented Jun 29, 2023

ErykKul commented Jul 4, 2023

ErykKul commented Sep 1, 2023

ErykKul commented Sep 1, 2023

ErykKul commented Oct 20, 2023

pdurbin commented Oct 20, 2023

qqmyers Apr 4, 2024

Choose a reason for hiding this comment

ErykKul Apr 17, 2024

Choose a reason for hiding this comment

qqmyers left a comment

Choose a reason for hiding this comment

pdurbin commented Apr 9, 2024

ErykKul commented Apr 19, 2024

jp-tosca commented Apr 19, 2024

ErykKul commented Apr 22, 2024

ErykKul commented Apr 22, 2024

poikilotherm commented Apr 22, 2024

ErykKul commented Apr 22, 2024

jp-tosca commented Apr 22, 2024

ErykKul commented May 2, 2024

ErykKul commented May 2, 2024

ErykKul commented May 6, 2024

qqmyers left a comment

Choose a reason for hiding this comment

coveralls commented Jun 28, 2023 •

edited

Loading