Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster combined query for retrieving datasets via API #9684

Merged
merged 16 commits into from
May 24, 2024

Conversation

ErykKul
Copy link
Collaborator

@ErykKul ErykKul commented Jun 28, 2023

What this PR does / why we need it:
It reduces the number of queries needed for retrieving a dataset via API. Especially for large dataset, but even for smaller datasets when there is large traffic on the server, this makes the usage of resources more efficient.

Which issue(s) this PR closes:

Closes #9683

Observed time for retrieving a dataset with 10000 files went from 1 minute to 35 seconds with this PR. Real life applications would be less spectacular, but still useful.

@coveralls
Copy link

coveralls commented Jun 28, 2023

Coverage Status

coverage: 20.58% (-0.004%) from 20.584%
when pulling fb4eb8b on ErykKul:9683_get_dataset_api_in_single_query
into 2bf05c1 on IQSS:develop.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Jun 29, 2023

I have run the integration tests locally, and they have passed. I have no idea why the tests failed on jenkins. Can someone check it for me? Thanks!

@ErykKul
Copy link
Collaborator Author

ErykKul commented Jul 4, 2023

Tests passed, it looks OK now.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 1, 2023

I have closed it: it is better to go for a real solution, this one is more a workaround. Also, it is not essential and does require merging...

@ErykKul ErykKul closed this Sep 1, 2023
@ErykKul ErykKul reopened this Sep 1, 2023
@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 1, 2023

Reopened and merged.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 20, 2023

@pdurbin
I did not investigate deeply, but it looks to me that this pull request might be still useful. Little bit unexpected since the linked issue #9763 is closed. If it is still useful, let me know and I will merge the latest develop branch to it, and resolve the conflicts. If I am missing something and it can be closed, feel free to close this without merging.

@pdurbin
Copy link
Member

pdurbin commented Oct 20, 2023

@ErykKul I'm not the strongest with databases so when I see eclipselink.left-join-fetch my inclination is to let another developer here take a look: @scolapasta @landreev or @sekmiller

@pdurbin pdurbin added Size: 3 A percentage of a sprint. 2.1 hours. Component: Code Infrastructure formerly "Feature: Code Infrastructure" Type: Feature a feature request labels Feb 28, 2024
@pdurbin pdurbin changed the title combined query for retrieving datasets with API Faster combined query for retrieving datasets via API Feb 28, 2024
@@ -127,7 +127,8 @@ public Dataset findDeep(Object pk) {
.setHint("eclipselink.left-join-fetch", "o.files.dataFileTags")
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas")
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.fileCategories")
//.setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses")
.setHint("eclipselink.left-join-fetch", "o.files.fileMetadatas.varGroups")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part of the optimization here? I would guess this isn't often accessed, so I'm not sure how much of a performance help it would be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had seen in in the query log even when it was not used. For large datasets it does make a difference (one query less for each file).

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks OK. I requested a couple changes and there are merge conflicts to resolve, but the overall change to get the 'deep' version of a dataset for the GET api/datasets/{id} looks fine. Once this is updated, should be ready to go.

@pdurbin
Copy link
Member

pdurbin commented Apr 9, 2024

@ErykKul heads up that there are merge conflicts.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 19, 2024

@qqmyers I merged the devlop branch into this PR and done some changes.
@jp-tosca I have seen that you are doing some fixes and tests for large datasets. Can you test this PR with your test case? Also, this PR might have impact on the SPA. The API calls that are "performance tuned" (are you seeing any difference in performance?) are in the Datasets API:

  • GET path:{id}/versions/{versionId}/files
  • GET path:{id}

I hope I did not break anything...

@jp-tosca
Copy link
Contributor

👋🏼 @ErykKul I created a large Dataset at https://beta.dataverse.org/dataset.xhtml?persistentId=doi:10.5072/FK2/F1MCGB and also published the tools on https://github.com/IQSS/dataverse-sample-data to create this so let me know if that is helpful 😄

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 22, 2024

@jp-tosca I just tested with a dataset with 10000 files, like I did for other PRs. Getting that dataset on the current develop branch took 24 seconds. With this PR it took 8 seconds, 3 times faster! I think this PR aged well (it is from 10 months ago). I will try the {id}/versions/{versionId}/files call with the new parameters (limit and offset). Is this one used by the SPA?

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 22, 2024

The files API got worse: for 10 files from a 10000 files dataset it went from 100 ms to 3 seconds. For getting all 10000 files it went from 22 seconds to 25 seconds. It looks like it adds 3 seconds to the result (I guess it is the time needed for find deep to finish?). I will remove the find deep from that method, and leave it only for getting the dataset.

@poikilotherm
Copy link
Contributor

Is it just me or would some Continuous Benchmarking be good here?

https://github.com/marketplace/actions/continuous-benchmark

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 22, 2024

It looks great! It would be very useful for some of the frequently used calls, like getting datasets. Probably out of scope for this PR. I am not sure why integration tests fail now, but I have seen on Slack that there might be a problem with Jenkins. Other than that, I think that this PR can go to QA?

@jp-tosca
Copy link
Contributor

@jp-tosca I just tested with a dataset with 10000 files, like I did for other PRs. Getting that dataset on the current develop branch took 24 seconds. With this PR it took 8 seconds, 3 times faster! I think this PR aged well (it is from 10 months ago). I will try the {id}/versions/{versionId}/files call with the new parameters (limit and offset). Is this one used by the SPA?

I would ask maybe @GPortas or @ekraffmiller to check what method is used by the SPA 😄

@ErykKul
Copy link
Collaborator Author

ErykKul commented May 2, 2024

@qqmyers Only one API call is impacted and it does work faster, regardless of from where it is called. I think that this PR can go to "ready for QA"

@ErykKul
Copy link
Collaborator Author

ErykKul commented May 2, 2024

Jenkins failed, I am trying to figure out how to check why (I remember that there was a way to see the build log).

@ErykKul
Copy link
Collaborator Author

ErykKul commented May 6, 2024

I have checked the action log, it looks like all tests passed, unit and integration tests (maybe I am missing something):
image

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK. For some reason, the last run couldn't launch the AWS machine so the IT tests didn't run, so perhaps a rerun should get triggered before this passes QA.

@sekmiller sekmiller self-assigned this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Code Infrastructure formerly "Feature: Code Infrastructure" Feature: Performance & Stability Size: 3 A percentage of a sprint. 2.1 hours. Type: Feature a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large amount of queries when getting a dataset with API
7 participants