Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search API results payload extension #10811

Merged
merged 18 commits into from
Sep 11, 2024

Conversation

GPortas
Copy link
Contributor

@GPortas GPortas commented Aug 29, 2024

What this PR does / why we need it:

Currently, the search API does not return all the attributes needed for files and collections, required to display the file and collection item cards.

I have extended the Search API result payloads to include:

  • For the Dataverse entity: "affiliation", "parentDataverseName", "parentDataverseIdentifier" and "image_url" (optional).
  • For the DataFile entity: "releaseOrCreateDate" and "image_url" (optional).
  • For the Dataset entity: "image_url" (optional).

The schema.xml file for Solr has been updated to include a new field called dvParentAlias for the parentDataverseIdentifier. This field is now indexed in the IndexServiceBean.

The image_url field was already included in the SolrSearchResult JSON, but it wasn’t returned by the API because it was appended only after the Solr query was processed in the SearchIncludeFragment of JSF. Now, the field is set in SearchServiceBean, ensuring it is always returned by the API when an image is available.

However, in the search.rst documentation, the field was incorrectly listed as being returned by the API, even though it never was since it was only set in JSF.

I have reused the same URL generation logic from JSF. But currently, URL generation differs between datasets and dataverses&files. For datasets, an absolute URL is generated, while for files and dataverses, a Base64 URL is generated. In a future iteration, it would be beneficial to standardize the URL format so that the same type of URL is returned for all entities. I have mentioned this in the docs.

Which issue(s) this PR closes:

Special notes for your reviewer:

For the next released version, a Solr reindex will be necessary to apply the new schema.xml version. (Mentioned in the release notes)

Suggestions on how to test this:

Create dataverses, datasets and files and test the search endpoint to verify that new params are correctly retrieved.

curl "http://localhost:8080/api/search?q=*&sort=date&order=desc" -H "X-Dataverse-Key: <YOUR_API_KEY>"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No

Is there a release notes update needed for this change?:
Yes

Additional documentation:
No

Preview docs at https://dataverse-guide--10811.org.readthedocs.build/en/10811/api/search.html

Related issues:

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

1 similar comment

This comment has been minimized.

@johannes-darms
Copy link
Contributor

Does this PR also resolved/includes the requested changes of #9788?

@GPortas
Copy link
Contributor Author

GPortas commented Aug 30, 2024

Does this PR also resolved/includes the requested changes of #9788?

@johannes-darms No. This PR adds new fields to the Search API result, without changing existing ones. The scopes are different.

I think what is proposed in #9788 requires further discussion, in particular for maintaining backward compatibility in the endpoint, since existing fields would be redefined / renamed.

@GPortas GPortas added Size: 3 A percentage of a sprint. 2.1 hours. SPA These changes are required for the Dataverse SPA User Role: API User Makes use of APIs GREI Re-arch Issues related to the GREI Dataverse rearchitecture labels Aug 30, 2024

This comment has been minimized.

2 similar comments

This comment has been minimized.

This comment has been minimized.

@GPortas GPortas marked this pull request as ready for review August 30, 2024 17:29

This comment has been minimized.

@johannes-darms
Copy link
Contributor

Does this PR also resolved/includes the requested changes of #9788?

@johannes-darms No. This PR adds new fields to the Search API result, without changing existing ones. The scopes are different.

Partly the issue list difference between list/search and get, but also the missing items needed to recreate the dataverse/collection cards based on the search response like logo, affiliation and description. Some of these are already implemented in this PR. Could we add the rest in this PR as well?

I think what is proposed in #9788 requires further discussion, in particular for maintaining backward compatibility in the endpoint, since existing fields would be redefined / renamed.

Only suggestion two of the problem changes a property, the rest are just additions. Phil has already proposed a solution that would work. I think this is a great PR to fix the problems listed in #9788 and #9713.

This comment has been minimized.

@GPortas
Copy link
Contributor Author

GPortas commented Sep 6, 2024

@johannes-darms

I am working on adding your requested properties. imageUrl was straightforward to expose, as you describe.

I see that the linked icon depends on the isInTree boolean flag property from the Solr result. Actually this property is not indexed; instead, it is post-calculated in the UI logic and appended to the SolrSearchResult, specifically in SearchIncludeFragment, as can be seen here: https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SearchIncludeFragment.java#L1475.

If we want to expose the flag as it is used by the JSF UI, we need to move the calculation logic out of SearchIncludeFragment, potentially to SearchServiceBean or IndexServiceBean.

What is already indexed and therefore straightforward to expose is subtreePaths, but for now I'm not sure if this information is enough to calculate the linked status, as in my tests I can't get the complete subtree starting from the root dataverse, which I believe is what we would need to determine the linked status. This field only returns the first dataverse below root. For instance, if I have this hierarchy: root > dv1 > dv2 > dv3, subtreePaths for entity dv2 returns "/<dv1_id>", instead of "/<dv1_id>/<dv2_id>" which is what I expect (the entire subtree).

I'm still looking into this, but I think determining the linked status may require exploring a solution that would go beyond the scope of this PR / issue.

@johannes-darms
Copy link
Contributor

johannes-darms commented Sep 6, 2024

I am working on adding your requested properties. imageUrl was straightforward to expose, as you describe.

Many thanks!

I see that the linked icon depends on the isInTree boolean flag property from the Solr result. Actually this property is not indexed; instead, it is post-calculated in the UI logic and appended to the SolrSearchResult, specifically in SearchIncludeFragment, as can be seen here: https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SearchIncludeFragment.java#L1475. [...]
I'm still looking into this, but I think determining the linked status may require exploring a solution that would go beyond the scope of this PR / issue.

Bummer, I totally agree with you that this refactoring should be a separated PR. Especially as there are two for loops to determine the harvested and linked status of an item, refactoring could improve the performance.

@GPortas GPortas force-pushed the 10810-search-api-payload-extensions branch from 8d048f3 to 29056fb Compare September 9, 2024 12:09
@GPortas GPortas removed their assignment Sep 9, 2024
@coveralls
Copy link

coveralls commented Sep 9, 2024

Coverage Status

coverage: 20.74% (+0.005%) from 20.735%
when pulling e517183 on 10810-search-api-payload-extensions
into 4143031 on develop.

This comment has been minimized.

@pdurbin pdurbin self-assigned this Sep 9, 2024
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great. I'm asking some questions in this review.

@@ -151,6 +173,8 @@ https://demo.dataverse.org/api/search?q=trees
}
}

Note that the image_url field, if exists, will be returned as a regular URL for Datasets, while for Files and Dataverses, it will be returned as a Base64 URL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels really weird. image_url is sometimes a URL (good!) and sometimes a base64 value (weird!)? Can we pick one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for keeping track in the PR. I repost the same info I sent to you on Slack.

For the SPA, using an absolute URL is more suitable than Base64, as it allows the browser to automatically cache the resources. However, I believe there might be performance-related reasons for using Base64. For reference, see: https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/search/SolrSearchResult.java#L42.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browser caching should work for the current UI as well, so I'm doubtful that there's a big performance gain embedding base64.

doc/release-notes/10810-search-api-payload-extensions.md Outdated Show resolved Hide resolved
],
"affiliation": "Dataverse.org",
"parentDataverseName": "Root",
"parentDataverseIdentifier": "root"
},
{
"name":"Darwin's Finches",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets can have identifier_of_dataverse but not parentDataverseIdentifier, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. parentDataverseIdentifier is only for Dataverses.

"publicationStatuses": [
"Published"
],
"releaseOrCreateDate": "2016-05-10T12:53:39Z"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files do not get parentDataverseIdentifier, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right! Just for Dataverses

Co-authored-by: Philip Durbin <philip_durbin@harvard.edu>

This comment has been minimized.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:10810-search-api-payload-extensions
ghcr.io/gdcc/configbaker:10810-search-api-payload-extensions

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run the code but it seems reasonable and @GPortas addressed my concern about image_url by explaining in the guides that it is expected to change in #10831. Tests are passing. Approved!

@sekmiller sekmiller self-assigned this Sep 10, 2024
@sekmiller sekmiller merged commit 2792cf9 into develop Sep 11, 2024
15 checks passed
@pdurbin pdurbin added this to the 6.4 milestone Sep 11, 2024
@pdurbin pdurbin mentioned this pull request Sep 18, 2024
@GPortas GPortas added SPA.Q3.1 Collection page results of all types Original size: 3 labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GREI Re-arch Issues related to the GREI Dataverse rearchitecture Original size: 3 Size: 3 A percentage of a sprint. 2.1 hours. SPA.Q3.1 Collection page results of all types SPA These changes are required for the Dataverse SPA User Role: API User Makes use of APIs
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Extend Search API object type payloads to include necessary information for collection item cards
6 participants