Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual search #1

Open
ismailmayat opened this issue Feb 9, 2017 · 4 comments
Open

Multilingual search #1

ismailmayat opened this issue Feb 9, 2017 · 4 comments

Comments

@ismailmayat
Copy link

James,

Not really an issue more an observation, the multilingual content is all being stored in the one index and using the one analyser set in examine config. Issue with that is you cannot then apply different analysers that are language specific. So where you are doing

e.Fields[string.Format(ZoombracoConstants.SearchConstants.MergedDataFieldTemplate, languages[i].CultureInfo.Name)] = mergedDataStringBuilders[i].ToString().Trim();

You have fields for each language instance but they are all using standard analyser. Ideally arabic should be using arabic analyser as that will give you better tokens than standard. For chinese korean languages as they are pictorial and as far as i am aware each pictogram can be more than one word. Standard analyser tokenises on space. Where as cn analyser does morphological analysis for creating tokens. Also language specific analyser will remove stop words and stem.

You could implement document writing and then index a specific field using a specific analyser but at that point you would need to get the vorto property and figure out its language then use appropriate analyser for it. I'm thinking better route would be on the property in the poco another field attribute that allows you to specify the analyser to use for a specific language instance.

Loving the project have sent it to cogworks team to have a play.

Ismail

@JimBobSquarePants
Copy link
Owner

Haha! Thought you'd pick up on that... 😄 I was hoping you'd see my tweet last night when I was asking about this.

So when I store the value I have to use a specific analyser? Didn't know that.

That might explain why I couldn't get highlighted fragments working when I was returning individual analysers on search.

private static Analyzer GetAnalyserForCulture(CultureInfo culture)

(Note the culture list in incomplete there as I didn't have "de", "fr" etc)

Do you know of any way to create and setup an index programmatically? Having to do it via configuration is such a chore and it would be great if I could use Umbraco's API on start to check that the indexes exist and create one if not.

Cheers for raising this. I'd love to get it absolutely correct.

Any other feedback from the team please let me know. I'm trying to build something that would let us compete with other CMS on a "We can do this already" sense.

Cheers

James

@JimBobSquarePants
Copy link
Owner

Another thought is maybe creating a new indexer that chooses the correct analyser for operations based on the the culture and uses that for each field. May then we can have one index but separate fields.... Dunno though.

@Shazwazza
Copy link

You can create an index programatically, see the unit tests in Examine core, but that's not really going to get your much more flexibility than you have with the config setup - but you can have a look. Also if you setup an index via code, it won't be added to the global index collection (as far as i remember) so you won't really be able to access it via the ExamineManager APIs.

Instead, you have full control of the Lucene document being written in the DocumentWriting event. You can literally do anything to the Lucene document and fields there. You can also create a custom analyzer that uses PerFieldAnalyzerWrapper https://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html which may make things a little quicker/easier for you if you want to 'just' put different language elements in different fields (i.e. perhaps with prefixed culture names)

The more indexes you have the slower things will be to rebuild, just keep that in mind, it's best to limit the number of indexes you have when possible! Otherwise doing a cold boot is going to take forever since each index rebuild is independent of the other one and it will need to re-lookup all of that umbraco data again.

@JimBobSquarePants
Copy link
Owner

JimBobSquarePants commented Feb 10, 2017

Thanks for that @Shazwazza

I had a dig around and settled for inheriting the PerFieldAnalyzerWrapper and adding the fields that we can support with the contrib Analyzers. It all seems to work well.

It seems though that only the standard analyzer supports highlighting when using wildcards which is a shame. I'll close this once I can confirm that is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants