From 3c55c3fa503b200fbe3b20e8d7d8965424e71160 Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Tue, 4 Jun 2024 15:23:14 -0400 Subject: [PATCH] avoid expensive Solr join for public dvObjects in search (experimental) (#10555) * avoid expensive Solr join when guest users search (affect IP Groups) #10554 * fix copy/past error, target doc for file, not dataset #10554 * Checking a few experimental changes into the branch: Jim's soft commit fixes from 10547; A quick experiment, replacing join on public objects with a boolean publicObject_b:true for logged-in users as well (with a join added for just for their own personal documents; groups are ignored for now). #10554 * Step 3, of the performance improvement effort relying on a boolean "publicObject" flag for published documents - now for logged-in users, AND with support for groups. Group support experimental, but appears to be working. #10554 * Modified the implementation for the guest user, to support ip groups. #10554 * Removed the few autocommit-related changes previously borrowed from 10547, to keep things separate and clear, for testing etc. #10554 * Reorganized the optimized code in SearchServiceBean; combined the code block for the guest and authenticated users. #10554 * updated the release note. #10554 * Removed the warning from the ip groups guide about the effect of the new search optimization feture that was no longer true. #10554 * Updated the section of the guide describing the new Solr optimization feature flags. #10554 * Updated the performance section of the guide. #10554 * Modified IndexServiceBean to use the new feature flag, that has been separated from the flag that enables the search-side optimization; Fixed the groups sub-query for the guest user. #10554 * cosmetic #10554 * doc tweaks #10554 * no-op code cleanup, correct case of publicObject_b #10554 --------- Co-authored-by: Leonid Andreev --- .../10554-avoid-solr-join-guest.md | 5 + .../source/developers/performance.rst | 4 + .../source/installation/config.rst | 6 + .../iq/dataverse/search/IndexServiceBean.java | 10 + .../iq/dataverse/search/SearchFields.java | 9 + .../dataverse/search/SearchServiceBean.java | 172 +++++++++++++----- .../iq/dataverse/settings/FeatureFlags.java | 22 +++ 7 files changed, 181 insertions(+), 47 deletions(-) create mode 100644 doc/release-notes/10554-avoid-solr-join-guest.md diff --git a/doc/release-notes/10554-avoid-solr-join-guest.md b/doc/release-notes/10554-avoid-solr-join-guest.md new file mode 100644 index 00000000000..956c658dbed --- /dev/null +++ b/doc/release-notes/10554-avoid-solr-join-guest.md @@ -0,0 +1,5 @@ +Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load. + +Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results. + +For details see https://dataverse-guide--10555.org.readthedocs.build/en/10555/installation/config.html#feature-flags and #10555. diff --git a/doc/sphinx-guides/source/developers/performance.rst b/doc/sphinx-guides/source/developers/performance.rst index 46c152f322e..562fa330d75 100644 --- a/doc/sphinx-guides/source/developers/performance.rst +++ b/doc/sphinx-guides/source/developers/performance.rst @@ -118,6 +118,10 @@ Solr While in the past Solr performance hasn't been much of a concern, in recent years we've noticed performance problems when Harvard Dataverse is under load. Improvements were made in `PR #10050 `_, for example. +We are tracking performance problems in `#10469 `_. + +In a meeting with a Solr expert on 2024-05-10 we were advised to avoid joins as much as possible. (It was acknowledged that many Solr users make use of joins because they have to, like we do, to keep some documents private.) Toward that end we have added two feature flags called ``avoid-expensive-solr-join`` and ``add-publicobject-solr-field`` as explained under :ref:`feature-flags`. It was confirmed experimentally that performing the join on all the public objects (published collections, datasets and files), i.e., the bulk of the content in the search index, was indeed very expensive, especially on a large instance the size of the IQSS prod. archive, especially under indexing load. We confirmed that it was in fact unnecessary and were able to replace it with a boolean field directly in the indexed documents, which is achieved by the two feature flags above. However, as of writing this, this mechanism should still be considered experimental. + Datasets with Large Numbers of Files or Versions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst index 907631e6236..8fb9460892b 100644 --- a/doc/sphinx-guides/source/installation/config.rst +++ b/doc/sphinx-guides/source/installation/config.rst @@ -3268,6 +3268,12 @@ please find all known feature flags below. Any of these flags can be activated u * - api-session-auth - Enables API authentication via session cookie (JSESSIONID). **Caution: Enabling this feature flag exposes the installation to CSRF risks!** We expect this feature flag to be temporary (only used by frontend developers, see `#9063 `_) and for the feature to be removed in the future. - ``Off`` + * - avoid-expensive-solr-join + - Changes the way Solr queries are constructed for public content (published Collections, Datasets and Files). It removes a very expensive Solr join on all such documents, improving overall performance, especially for large instances under heavy load. Before this feature flag is enabled, the corresponding indexing feature (see next feature flag) must be turned on and a full reindex performed (otherwise public objects are not going to be shown in search results). See :doc:`/admin/solr-search-index`. + - ``Off`` + * - add-publicobject-solr-field + - Adds an extra boolean field `PublicObject_b:true` for public content (published Collections, Datasets and Files). Once reindexed with these fields, we can rely on it to remove a very expensive Solr join on all such documents in Solr queries, significantly improving overall performance (by enabling the feature flag above, `avoid-expensive-solr-join`). These two flags are separate so that an instance can reindex their holdings before enabling the optimization in searches, thus avoiding having their public objects temporarily disappear from search results while the reindexing is in progress. + - ``Off`` **Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development `, you can set them in the `docker compose `_ file. diff --git a/src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java b/src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java index e61b93a741f..24efec4cca3 100644 --- a/src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java +++ b/src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java @@ -12,6 +12,7 @@ import edu.harvard.iq.dataverse.datavariable.VariableMetadataUtil; import edu.harvard.iq.dataverse.datavariable.VariableServiceBean; import edu.harvard.iq.dataverse.harvest.client.HarvestingClient; +import edu.harvard.iq.dataverse.settings.FeatureFlags; import edu.harvard.iq.dataverse.settings.JvmSettings; import edu.harvard.iq.dataverse.settings.SettingsServiceBean; import edu.harvard.iq.dataverse.util.FileUtil; @@ -214,6 +215,9 @@ public Future indexDataverse(Dataverse dataverse, boolean processPaths) solrInputDocument.addField(SearchFields.DATAVERSE_CATEGORY, dataverse.getIndexableCategoryName()); if (dataverse.isReleased()) { solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING); + if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) { + solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true); + } solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE, dataverse.getPublicationDate()); } else { solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, UNPUBLISHED_STRING); @@ -878,6 +882,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set groups; + + if (user instanceof GuestUser) { + // Yes, GuestUser may be part of one or more groups; such as IP Groups. + groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest)); + } else { + if (!(user instanceof AuthenticatedUser)) { + logger.severe("Should never reach here. A User must be an AuthenticatedUser or a Guest"); + throw new IllegalStateException("A User must be an AuthenticatedUser or a Guest"); + } + + au = (AuthenticatedUser) user; + + // ---------------------------------------------------- + // (3) Is this a Super User? + // If so, they can see everything + // ---------------------------------------------------- + if (au.isSuperuser()) { + // Somewhat dangerous because this user (a superuser) will be able + // to see everything in Solr with no regard to permissions. But it's + // been this way since Dataverse 4.0. So relax. :) + + return dangerZoneNoSolrJoin; + } + + // ---------------------------------------------------- + // (4) User is logged in AND onlyDatatRelatedToMe == true + // Yes, give back everything -> the settings will be in + // the filterqueries given to search + // ---------------------------------------------------- + if (onlyDatatRelatedToMe == true) { + if (systemConfig.myDataDoesNotUsePermissionDocs()) { + logger.fine("old 4.2 behavior: MyData is not using Solr permission docs"); + return dangerZoneNoSolrJoin; + } else { + // fall-through + logger.fine("new post-4.2 behavior: MyData is using Solr permission docs"); + } + } + + // ---------------------------------------------------- + // (5) Work with Authenticated User who is not a Superuser + // ---------------------------------------------------- + + groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest)); + } + + if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) { + /** + * Instead of doing a super expensive join, we will rely on the + * new boolean field PublicObject:true for public objects. This field + * is indexed on the content document itself, rather than a permission + * document. An additional join will be added only for any extra, + * more restricted groups that the user may be part of. + * **Note the experimental nature of this optimization**. + */ + StringBuilder sb = new StringBuilder(); + StringBuilder sbgroups = new StringBuilder(); + + // All users, guests and authenticated, should see all the + // documents marked as publicObject_b:true, at least: + sb.append(SearchFields.PUBLIC_OBJECT + ":" + true); + + // One or more groups *may* also be available for this user. Once again, + // do note that Guest users may be part of some groups, such as + // IP groups. + + int groupCounter = 0; + + // An AuthenticatedUser should also be able to see all the content + // on which they have direct permissions: + if (au != null) { + groupCounter++; + sbgroups.append(IndexServiceBean.getGroupPerUserPrefix() + au.getId()); + } + + // In addition to the user referenced directly, we will also + // add joins on all the non-public groups that may exist for the + // user: + for (Group group : groups) { + String groupAlias = group.getAlias(); + if (groupAlias != null && !groupAlias.isEmpty() && !groupAlias.startsWith("builtIn")) { + groupCounter++; + if (groupCounter > 1) { + sbgroups.append(" OR "); + } + sbgroups.append(IndexServiceBean.getGroupPrefix() + groupAlias); + } + } + + if (groupCounter > 1) { + // If there is more than one group, the parentheses must be added: + sbgroups.insert(0, "("); + sbgroups.append(")"); + } + + if (groupCounter > 0) { + // If there are any groups for this user, an extra join must be + // added to the query, and the extra sub-query must be added to + // the combined Solr query: + sb.append(" OR {!join from=" + SearchFields.DEFINITION_POINT + " to=id v=$q1}"); + // Add the subquery to the combined Solr query: + solrQuery.setParam("q1", SearchFields.DISCOVERABLE_BY + ":" + sbgroups.toString()); + logger.info("The sub-query q1 set to " + SearchFields.DISCOVERABLE_BY + ":" + sbgroups.toString()); + } + + String ret = sb.toString(); + logger.info("Returning experimental query: " + ret); + return ret; + } + + // END OF EXPERIMENTAL OPTIMIZATION + + // Old, un-optimized way of handling permissions. + // Largely left intact, minus the lookups that have already been performed + // above. + // ---------------------------------------------------- // (1) Is this a GuestUser? - // Yes, see if GuestUser is part of any groups such as IP Groups. // ---------------------------------------------------- if (user instanceof GuestUser) { - String groupsFromProviders = ""; - Set groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest)); + StringBuilder sb = new StringBuilder(); + + String groupsFromProviders = ""; for (Group group : groups) { logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias()); String groupAlias = group.getAlias(); @@ -1025,51 +1144,11 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ return guestWithGroups; } - // ---------------------------------------------------- - // (2) Retrieve Authenticated User - // ---------------------------------------------------- - if (!(user instanceof AuthenticatedUser)) { - logger.severe("Should never reach here. A User must be an AuthenticatedUser or a Guest"); - throw new IllegalStateException("A User must be an AuthenticatedUser or a Guest"); - } - - AuthenticatedUser au = (AuthenticatedUser) user; - - // if (addFacets) { - // // Logged in user, has publication status facet - // // - // solrQuery.addFacetField(SearchFields.PUBLICATION_STATUS); - // } - - // ---------------------------------------------------- - // (3) Is this a Super User? - // Yes, give back everything - // ---------------------------------------------------- - if (au.isSuperuser()) { - // Somewhat dangerous because this user (a superuser) will be able - // to see everything in Solr with no regard to permissions. But it's - // been this way since Dataverse 4.0. So relax. :) - - return dangerZoneNoSolrJoin; - } - - // ---------------------------------------------------- - // (4) User is logged in AND onlyDatatRelatedToMe == true - // Yes, give back everything -> the settings will be in - // the filterqueries given to search - // ---------------------------------------------------- - if (onlyDatatRelatedToMe == true) { - if (systemConfig.myDataDoesNotUsePermissionDocs()) { - logger.fine("old 4.2 behavior: MyData is not using Solr permission docs"); - return dangerZoneNoSolrJoin; - } else { - logger.fine("new post-4.2 behavior: MyData is using Solr permission docs"); - } - } - // ---------------------------------------------------- // (5) Work with Authenticated User who is not a Superuser - // ---------------------------------------------------- + // ---------------------------------------------------- + // It was already confirmed, that if the user is not GuestUser, we + // have an AuthenticatedUser au which is not null. /** * @todo all this code needs cleanup and clarification. */ @@ -1100,7 +1179,6 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ * a given "content document" (dataset version, etc) in Solr. */ String groupsFromProviders = ""; - Set groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest)); StringBuilder sb = new StringBuilder(); for (Group group : groups) { logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias()); diff --git a/src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java b/src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java index afa5a1c986a..14a7ab86f22 100644 --- a/src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java +++ b/src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java @@ -36,6 +36,28 @@ public enum FeatureFlags { * @since Dataverse @TODO: */ API_BEARER_AUTH("api-bearer-auth"), + /** + * For published (public) objects, don't use a join when searching Solr. + * Experimental! Requires a reindex with the following feature flag enabled, + * in order to add the boolean publicObject_b:true field to all the public + * Solr documents. + * + * @apiNote Raise flag by setting + * "dataverse.feature.avoid-expensive-solr-join" + * @since Dataverse 6.3 + */ + AVOID_EXPENSIVE_SOLR_JOIN("avoid-expensive-solr-join"), + /** + * With this flag enabled, the boolean field publicObject_b:true will be + * added to all the indexed Solr documents for publicly-available collections, + * datasets and files. This flag makes it possible to rely on it in searches, + * instead of the very expensive join (the feature flag above). + * + * @apiNote Raise flag by setting + * "dataverse.feature.add-publicobject-solr-field" + * @since Dataverse 6.3 + */ + ADD_PUBLICOBJECT_SOLR_FIELD("add-publicobject-solr-field"), ; final String flag;