-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify Solr suggestions search to handle wonky case with stemming and wildcards #3990
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rominail -- one question:
Thanks, @rominail. On further inspection, though, is there a reason you didn't just add the '.' character to the existing $forbidden array? I wonder if that would have the same desired effect... Also, regarding your question about composer.json, we do already have a dependency on the intl extension, but the composer.json has not been updated to reflect all of the PHP extension requirements. It might be beneficial to better reflect those requirements, but I don't think we need to do it here -- that might be a project for a separate PR if it's worth the effort. |
Should we then just get rid of any symbols / non-alphanumerical values? |
It depends on how Solr analysis is set up. Historically I've tried to make as few changes as possible to the input string to let Solr's native text processing do most of the work... but there are edge cases where things can go wrong, which is why we're doing some sanitization. I wouldn't want to over-process text unnecessarily because that might introduce new and different problems. I'm not even sure if replacing the |
What should I do? Any idea on how to deal with this? |
My first suggestion would be to try adding |
We found out that wildcards and stemming don't go well together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rominail, this seems like a reasonable approach; see below for a few minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more small comment changes that I'll go ahead and apply myself...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rominail, this looks good to me, and all tests are passing.
I'm going to leave this open for a couple of days in case anyone else has feedback, but I think it should be safe to merge!
No one has objected, so I'm merging now. Thanks again, @rominail! |
@demiankatz There are some other issues with this implementation. Take the following title, for instance: Could we keep some of these characters before sending the query to Solr ? Ultimately having a different normalizer for indexed data and queries is going to cause problems like this. And |
@damien-git, those characters are being stripped because of their potential to cause Solr syntax errors. However, the stripping code goes back many years, and it's possible that the edismax handler that we're using in most situations is more robust than whatever was used in the past. Is it possible we need to add even more fallback logic -- e.g. a switch to toggle the query sanitization -- so we have another option to try? I don't really like the idea of chaining so many potential Solr queries inside an action that needs to be very fast in order to be useful, but if it's only needed for rare edge cases, it may not be such a problem. |
@demiankatz I would like to consider removing this stripping rather than making it more complex. I thought maybe it was done to reduce the risk of injections (hence the name "forbidden"). If we want to reduce Solr syntax errors that list is not right: for instance the single quote is not a special Solr character. Also, some people might want to use the Solr syntax, for instance with boolean expressions. Most importantly, we should have consistency between autocomplete and the normal search, and currently the query filters are not consistent. |
The original intent of the stripping was to prevent situations like a book called "Book: the subtitle" from yielding a solr "no field called Book:" error, and that type of thing. The "forbidden" wasn't really about security as much as stability. As noted, this is very old code, so if you have a better vision for it, we can certainly revisit it. |
Good to know. I am not getting this kind of error with edismax. Maybe Solr queries were using q to specify fields at the time this code was written ? |
That code does go all the way back to VuFind 1.x, so it probably predates our use of Dismax, and definitely predates our use of eDismax. |
I just found another bug in |
For titles ending with dot, when the user search the whole title including the dot the suggestions are empty.
Should I add, in composer.json, the required exention
ext-intl
withinprovide
orrequire
?