-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decompound words #42
Comments
Hmmh, the plugin could be overhead as we only need decompounding for a small subset of the German language. Because it could influence the relevance negativly if we decompound 'baerwaldstraße' into "baer", "wald" and "straße". So we should just use the normal decompound stuff from elasticsearch and provide our own world list |
I explored elasticsearch's hyphenation_decompounder. Currently, there exist some issues which need further work:
I'd prepare a WIP PR. The custom plugin should become an artifact on it's own, so what would you recommend: creating a github project on it's own? creating it as maven submodule? |
Thanks, sounds good to me! I'll then close #46? Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?
I would prefer this, yes. |
Yes. I'll reuse the patterns in de.txt and see if applying the decompounder only for .de. helps.
Yes, internal changes are required to allow a null word_list tweak the subword offsets. To avoid JarHell exceptions, I won't patch the original code but copy and adapt. If I get things working, I'd submit them to ES/Lucene as well. But I've no idea, if hyphenation is really used in the ES/Lucene community. Current behavior looks too odd, IMHO. |
One more normalization has to be done to improve search. E.g. Erlangerstraße will be split into "Erlanger straße". This has to be done while indexing and searching.
There is a plugin but it is GPL due to one used library, it could be less restrict but phyton code has to be ported to Java: jprante/elasticsearch-analysis-decompound#5
For POIs there is also often the Bahnhof vs. Hauptbahnhof problem. But probably we should get a main railway station that is not named like this also important in a different country. But probably this should be handled via a different fix: #318
The text was updated successfully, but these errors were encountered: