Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990

rominail · 2024-10-07T15:20:17Z

For titles ending with dot, when the user search the whole title including the dot the suggestions are empty.
Should I add, in composer.json, the required exention ext-intl within provide or require?

demiankatz

Thanks, @rominail -- one question:

module/VuFind/src/VuFind/Autocomplete/Solr.php

demiankatz · 2024-10-07T16:26:05Z

Thanks, @rominail. On further inspection, though, is there a reason you didn't just add the '.' character to the existing $forbidden array? I wonder if that would have the same desired effect...

Also, regarding your question about composer.json, we do already have a dependency on the intl extension, but the composer.json has not been updated to reflect all of the PHP extension requirements. It might be beneficial to better reflect those requirements, but I don't think we need to do it here -- that might be a project for a separate PR if it's worth the effort.

rominail · 2024-10-07T17:19:17Z

Should we then just get rid of any symbols / non-alphanumerical values?

demiankatz · 2024-10-07T18:20:54Z

Should we then just get rid of any symbols / non-alphanumerical values?

It depends on how Solr analysis is set up. Historically I've tried to make as few changes as possible to the input string to let Solr's native text processing do most of the work... but there are edge cases where things can go wrong, which is why we're doing some sanitization. I wouldn't want to over-process text unnecessarily because that might introduce new and different problems. I'm not even sure if replacing the . character is a good idea -- we would have to do some tests on strings that have periods in the middle as well as at the end to be sure things behave as expected.

rominail · 2024-10-10T18:08:54Z

What should I do? Any idea on how to deal with this?

demiankatz · 2024-10-10T18:20:59Z

What should I do? Any idea on how to deal with this?

My first suggestion would be to try adding '.' to the existing "forbidden" array, instead of making end-of-string-specific changes. Then do some testing and see if it solves your problem and also works appropriately for titles that contain . characters within the string. If that works, it might be the smallest/simplest solution. But if it causes weird side effects, we may need to discuss further.

…hFix

rominail · 2024-10-15T19:14:15Z

We found out that wildcards and stemming don't go well together.
The solution I implemented consists in re-doing the Solr search without the wildcard if the first search doesn't return results.
What is your opinion on making a configuration to turn the feature on/off? Should everybody has the feature or not?

demiankatz

Thanks, @rominail, this seems like a reasonable approach; see below for a few minor comments.

module/VuFind/src/VuFind/Autocomplete/Solr.php

module/VuFind/src/VuFind/Autocomplete/SolrCN.php

demiankatz

A couple more small comment changes that I'll go ahead and apply myself...

module/VuFind/src/VuFind/Autocomplete/Solr.php

module/VuFind/src/VuFind/Autocomplete/SolrCN.php

demiankatz

Thanks, @rominail, this looks good to me, and all tests are passing.

I'm going to leave this open for a couple of days in case anyone else has feedback, but I think it should be safe to merge!

demiankatz · 2024-10-25T12:03:26Z

No one has objected, so I'm merging now. Thanks again, @rominail!

damien-git · 2024-10-25T20:12:24Z

@demiankatz There are some other issues with this implementation. Take the following title, for instance:
Habif's clinical dermatology : a color guide to diagnosis and therapy / James G.H. Dinulos.
And this query:
Habif's clinical dermatology
The title will not match because of the removed "forbidden" characters. For instance, title_full is not found because ' is replaced by space for the query but stemming removes 's in the field.

Could we keep some of these characters before sending the query to Solr ? Ultimately having a different normalizer for indexed data and queries is going to cause problems like this. And ' and : are often part of titles.

demiankatz · 2024-10-28T12:49:11Z

@damien-git, those characters are being stripped because of their potential to cause Solr syntax errors. However, the stripping code goes back many years, and it's possible that the edismax handler that we're using in most situations is more robust than whatever was used in the past. Is it possible we need to add even more fallback logic -- e.g. a switch to toggle the query sanitization -- so we have another option to try? I don't really like the idea of chaining so many potential Solr queries inside an action that needs to be very fast in order to be useful, but if it's only needed for rare edge cases, it may not be such a problem.

damien-git · 2024-10-28T14:05:44Z

@demiankatz I would like to consider removing this stripping rather than making it more complex. I thought maybe it was done to reduce the risk of injections (hence the name "forbidden"). If we want to reduce Solr syntax errors that list is not right: for instance the single quote is not a special Solr character. Also, some people might want to use the Solr syntax, for instance with boolean expressions. Most importantly, we should have consistency between autocomplete and the normal search, and currently the query filters are not consistent.

demiankatz · 2024-10-28T14:09:14Z

The original intent of the stripping was to prevent situations like a book called "Book: the subtitle" from yielding a solr "no field called Book:" error, and that type of thing. The "forbidden" wasn't really about security as much as stability. As noted, this is very old code, so if you have a better vision for it, we can certainly revisit it.

damien-git · 2024-10-28T16:51:04Z

Good to know. I am not getting this kind of error with edismax. Maybe Solr queries were using q to specify fields at the time this code was written ?

demiankatz · 2024-10-28T16:54:00Z

Good to know. I am not getting this kind of error with edismax. Maybe Solr queries were using q to specify fields at the time this code was written ?

That code does go all the way back to VuFind 1.x, so it probably predates our use of Dismax, and definitely predates our use of eDismax.

damien-git · 2024-10-29T13:50:59Z

I just found another bug in Solr.php (for alphabrowse it uses title_sort in pickBestMatch(), and sometimes discards the top result because it is not an exact match because it is missing a word like "The" which was removed by a 245 indicator (this only happens when there are other exact matches)).
I will experiment locally with a custom version resolving the issues I mentioned and plan a VuFind PR later.

Fix Solr suggestions search

509ca95

rominail force-pushed the searchFix branch from 6750ba3 to 509ca95 Compare October 7, 2024 15:26

demiankatz requested changes Oct 7, 2024

View reviewed changes

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved

Discussion changes

ab29f98

rominail added 2 commits October 15, 2024 15:05

Merge branch 'dev' of https://github.com/vufind-org/vufind into searc…

ce6eef4

…hFix

Run solr search without wildcard if empty with

cc17460

rominail changed the title ~~Fix Solr suggestions search~~ Modify Solr suggestions search to handle wonky case with stemming and wildcards Oct 15, 2024

Fix pipelines

f85a395

Fix pipelines

4373487

demiankatz requested changes Oct 17, 2024

View reviewed changes

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved

rominail force-pushed the searchFix branch from fa5382a to b65a36e Compare October 18, 2024 14:09

Discussion changes

b018b5d

rominail force-pushed the searchFix branch from b65a36e to b018b5d Compare October 18, 2024 15:31

demiankatz reviewed Oct 18, 2024

View reviewed changes

module/VuFind/src/VuFind/Autocomplete/SolrCN.php Show resolved Hide resolved

Update module/VuFind/src/VuFind/Autocomplete/SolrCN.php

27740a9

demiankatz added this to the 10.1 milestone Oct 18, 2024

demiankatz added the improvement label Oct 18, 2024

demiankatz requested changes Oct 18, 2024

View reviewed changes

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved

module/VuFind/src/VuFind/Autocomplete/SolrCN.php Outdated Show resolved Hide resolved

demiankatz added 2 commits October 18, 2024 14:12

Update module/VuFind/src/VuFind/Autocomplete/Solr.php

3fe043b

Update module/VuFind/src/VuFind/Autocomplete/SolrCN.php

aae772e

demiankatz approved these changes Oct 18, 2024

View reviewed changes

demiankatz merged commit e4f564c into vufind-org:dev Oct 25, 2024
6 checks passed

demiankatz deleted the searchFix branch October 25, 2024 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990

Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990

rominail commented Oct 7, 2024 •

edited

Loading

demiankatz left a comment

demiankatz commented Oct 7, 2024

rominail commented Oct 7, 2024

demiankatz commented Oct 7, 2024

rominail commented Oct 10, 2024

demiankatz commented Oct 10, 2024

rominail commented Oct 15, 2024

demiankatz left a comment

demiankatz left a comment

demiankatz left a comment

demiankatz commented Oct 25, 2024

damien-git commented Oct 25, 2024

demiankatz commented Oct 28, 2024

damien-git commented Oct 28, 2024

demiankatz commented Oct 28, 2024

damien-git commented Oct 28, 2024

demiankatz commented Oct 28, 2024

damien-git commented Oct 29, 2024

Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990

Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990

Conversation

rominail commented Oct 7, 2024 • edited Loading

demiankatz left a comment

Choose a reason for hiding this comment

demiankatz commented Oct 7, 2024

rominail commented Oct 7, 2024

demiankatz commented Oct 7, 2024

rominail commented Oct 10, 2024

demiankatz commented Oct 10, 2024

rominail commented Oct 15, 2024

demiankatz left a comment

Choose a reason for hiding this comment

demiankatz left a comment

Choose a reason for hiding this comment

demiankatz left a comment

Choose a reason for hiding this comment

demiankatz commented Oct 25, 2024

damien-git commented Oct 25, 2024

demiankatz commented Oct 28, 2024

damien-git commented Oct 28, 2024

demiankatz commented Oct 28, 2024

damien-git commented Oct 28, 2024

demiankatz commented Oct 28, 2024

damien-git commented Oct 29, 2024

rominail commented Oct 7, 2024 •

edited

Loading