-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual search #1
Comments
Haha! Thought you'd pick up on that... 😄 I was hoping you'd see my tweet last night when I was asking about this. So when I store the value I have to use a specific analyser? Didn't know that. That might explain why I couldn't get highlighted fragments working when I was returning individual analysers on search.
(Note the culture list in incomplete there as I didn't have "de", "fr" etc) Do you know of any way to create and setup an index programmatically? Having to do it via configuration is such a chore and it would be great if I could use Umbraco's API on start to check that the indexes exist and create one if not. Cheers for raising this. I'd love to get it absolutely correct. Any other feedback from the team please let me know. I'm trying to build something that would let us compete with other CMS on a "We can do this already" sense. Cheers James |
Another thought is maybe creating a new indexer that chooses the correct analyser for operations based on the the culture and uses that for each field. May then we can have one index but separate fields.... Dunno though. |
You can create an index programatically, see the unit tests in Examine core, but that's not really going to get your much more flexibility than you have with the config setup - but you can have a look. Also if you setup an index via code, it won't be added to the global index collection (as far as i remember) so you won't really be able to access it via the ExamineManager APIs. Instead, you have full control of the Lucene document being written in the DocumentWriting event. You can literally do anything to the Lucene document and fields there. You can also create a custom analyzer that uses PerFieldAnalyzerWrapper https://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html which may make things a little quicker/easier for you if you want to 'just' put different language elements in different fields (i.e. perhaps with prefixed culture names) The more indexes you have the slower things will be to rebuild, just keep that in mind, it's best to limit the number of indexes you have when possible! Otherwise doing a cold boot is going to take forever since each index rebuild is independent of the other one and it will need to re-lookup all of that umbraco data again. |
Thanks for that @Shazwazza I had a dig around and settled for inheriting the PerFieldAnalyzerWrapper and adding the fields that we can support with the contrib Analyzers. It all seems to work well. It seems though that only the standard analyzer supports highlighting when using wildcards which is a shame. I'll close this once I can confirm that is the case. |
James,
Not really an issue more an observation, the multilingual content is all being stored in the one index and using the one analyser set in examine config. Issue with that is you cannot then apply different analysers that are language specific. So where you are doing
e.Fields[string.Format(ZoombracoConstants.SearchConstants.MergedDataFieldTemplate, languages[i].CultureInfo.Name)] = mergedDataStringBuilders[i].ToString().Trim();
You have fields for each language instance but they are all using standard analyser. Ideally arabic should be using arabic analyser as that will give you better tokens than standard. For chinese korean languages as they are pictorial and as far as i am aware each pictogram can be more than one word. Standard analyser tokenises on space. Where as cn analyser does morphological analysis for creating tokens. Also language specific analyser will remove stop words and stem.
You could implement document writing and then index a specific field using a specific analyser but at that point you would need to get the vorto property and figure out its language then use appropriate analyser for it. I'm thinking better route would be on the property in the poco another field attribute that allows you to specify the analyser to use for a specific language instance.
Loving the project have sent it to cogworks team to have a play.
Ismail
The text was updated successfully, but these errors were encountered: