Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990

Merged
merged 10 commits into from
Oct 25, 2024

Conversation

rominail
Copy link
Contributor

@rominail rominail commented Oct 7, 2024

For titles ending with dot, when the user search the whole title including the dot the suggestions are empty.
Should I add, in composer.json, the required exention ext-intl within provide or require?

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rominail -- one question:

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved
@demiankatz
Copy link
Member

Thanks, @rominail. On further inspection, though, is there a reason you didn't just add the '.' character to the existing $forbidden array? I wonder if that would have the same desired effect...

Also, regarding your question about composer.json, we do already have a dependency on the intl extension, but the composer.json has not been updated to reflect all of the PHP extension requirements. It might be beneficial to better reflect those requirements, but I don't think we need to do it here -- that might be a project for a separate PR if it's worth the effort.

@rominail
Copy link
Contributor Author

rominail commented Oct 7, 2024

Should we then just get rid of any symbols / non-alphanumerical values?

@demiankatz
Copy link
Member

Should we then just get rid of any symbols / non-alphanumerical values?

It depends on how Solr analysis is set up. Historically I've tried to make as few changes as possible to the input string to let Solr's native text processing do most of the work... but there are edge cases where things can go wrong, which is why we're doing some sanitization. I wouldn't want to over-process text unnecessarily because that might introduce new and different problems. I'm not even sure if replacing the . character is a good idea -- we would have to do some tests on strings that have periods in the middle as well as at the end to be sure things behave as expected.

@rominail
Copy link
Contributor Author

What should I do? Any idea on how to deal with this?

@demiankatz
Copy link
Member

What should I do? Any idea on how to deal with this?

My first suggestion would be to try adding '.' to the existing "forbidden" array, instead of making end-of-string-specific changes. Then do some testing and see if it solves your problem and also works appropriately for titles that contain . characters within the string. If that works, it might be the smallest/simplest solution. But if it causes weird side effects, we may need to discuss further.

@rominail rominail changed the title Fix Solr suggestions search Modify Solr suggestions search to handle wonky case with stemming and wildcards Oct 15, 2024
@rominail
Copy link
Contributor Author

We found out that wildcards and stemming don't go well together.
The solution I implemented consists in re-doing the Solr search without the wildcard if the first search doesn't return results.
What is your opinion on making a configuration to turn the feature on/off? Should everybody has the feature or not?

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rominail, this seems like a reasonable approach; see below for a few minor comments.

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved
module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved
@demiankatz demiankatz added this to the 10.1 milestone Oct 18, 2024
Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more small comment changes that I'll go ahead and apply myself...

module/VuFind/src/VuFind/Autocomplete/Solr.php Outdated Show resolved Hide resolved
module/VuFind/src/VuFind/Autocomplete/SolrCN.php Outdated Show resolved Hide resolved
Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rominail, this looks good to me, and all tests are passing.

I'm going to leave this open for a couple of days in case anyone else has feedback, but I think it should be safe to merge!

@demiankatz
Copy link
Member

No one has objected, so I'm merging now. Thanks again, @rominail!

@demiankatz demiankatz merged commit e4f564c into vufind-org:dev Oct 25, 2024
6 checks passed
@demiankatz demiankatz deleted the searchFix branch October 25, 2024 12:03
@damien-git
Copy link
Contributor

@demiankatz There are some other issues with this implementation. Take the following title, for instance:
Habif's clinical dermatology : a color guide to diagnosis and therapy / James G.H. Dinulos.
And this query:
Habif's clinical dermatology
The title will not match because of the removed "forbidden" characters. For instance, title_full is not found because ' is replaced by space for the query but stemming removes 's in the field.

Could we keep some of these characters before sending the query to Solr ? Ultimately having a different normalizer for indexed data and queries is going to cause problems like this. And ' and : are often part of titles.

@demiankatz
Copy link
Member

@damien-git, those characters are being stripped because of their potential to cause Solr syntax errors. However, the stripping code goes back many years, and it's possible that the edismax handler that we're using in most situations is more robust than whatever was used in the past. Is it possible we need to add even more fallback logic -- e.g. a switch to toggle the query sanitization -- so we have another option to try? I don't really like the idea of chaining so many potential Solr queries inside an action that needs to be very fast in order to be useful, but if it's only needed for rare edge cases, it may not be such a problem.

@damien-git
Copy link
Contributor

@demiankatz I would like to consider removing this stripping rather than making it more complex. I thought maybe it was done to reduce the risk of injections (hence the name "forbidden"). If we want to reduce Solr syntax errors that list is not right: for instance the single quote is not a special Solr character. Also, some people might want to use the Solr syntax, for instance with boolean expressions. Most importantly, we should have consistency between autocomplete and the normal search, and currently the query filters are not consistent.

@demiankatz
Copy link
Member

The original intent of the stripping was to prevent situations like a book called "Book: the subtitle" from yielding a solr "no field called Book:" error, and that type of thing. The "forbidden" wasn't really about security as much as stability. As noted, this is very old code, so if you have a better vision for it, we can certainly revisit it.

@damien-git
Copy link
Contributor

Good to know. I am not getting this kind of error with edismax. Maybe Solr queries were using q to specify fields at the time this code was written ?

@demiankatz
Copy link
Member

Good to know. I am not getting this kind of error with edismax. Maybe Solr queries were using q to specify fields at the time this code was written ?

That code does go all the way back to VuFind 1.x, so it probably predates our use of Dismax, and definitely predates our use of eDismax.

@damien-git
Copy link
Contributor

I just found another bug in Solr.php (for alphabrowse it uses title_sort in pickBestMatch(), and sometimes discards the top result because it is not an exact match because it is missing a word like "The" which was removed by a 245 indicator (this only happens when there are other exact matches)).
I will experiment locally with a custom version resolving the issues I mentioned and plan a VuFind PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants