Decompound words #42

karussell · 2014-05-16T09:42:23Z

One more normalization has to be done to improve search. E.g. Erlangerstraße will be split into "Erlanger straße". This has to be done while indexing and searching.

There is a plugin but it is GPL due to one used library, it could be less restrict but phyton code has to be ported to Java: jprante/elasticsearch-analysis-decompound#5

For POIs there is also often the Bahnhof vs. Hauptbahnhof problem. But probably we should get a main railway station that is not named like this also important in a different country. But probably this should be handled via a different fix: #318

karussell · 2014-05-16T11:12:13Z

Hmmh, the plugin could be overhead as we only need decompounding for a small subset of the German language. Because it could influence the relevance negativly if we decompound 'baerwaldstraße' into "baer", "wald" and "straße". So we should just use the normal decompound stuff from elasticsearch and provide our own world list

hbruch · 2018-01-17T09:05:51Z

I explored elasticsearch's hyphenation_decompounder. Currently, there exist some issues which need further work:

ES currently requires an explicit dictionary (word_list) that must contain all subwords to be returned as tokens. Underlying lucene token filter does not, so a custom plugin that instantiates the token filter without word list would work (see this discussion).
the hyphenation token filter returns all subwords with offsets identical to the compound word, which results in treating all subwords as synonyms in the query phase. In consequence e.g. searching 'Erlangerstraße' would treat 'erlanger' and 'strasse' as synonyms, which is not intended (see same discussion).
lucene's hyphenation pattern handling has a bit shifting issue (see LUCENE-8124), so patterns are restricted to hyphenation markers 1 to 6. As we'd provide a custom hyphenation pattern file, so this just a minor, avoidable issue.

I'd prepare a WIP PR. The custom plugin should become an artifact on it's own, so what would you recommend: creating a github project on it's own? creating it as maven submodule?

karussell · 2018-01-17T09:11:20Z

Thanks, sounds good to me! I'll then close #46?

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

creating it as maven submodule?

I would prefer this, yes.

hbruch · 2018-01-17T09:30:07Z

Thanks, sounds good to me! I'll then close #46?

Yes. I'll reuse the patterns in de.txt and see if applying the decompounder only for .de. helps.

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

Yes, internal changes are required to allow a null word_list tweak the subword offsets. To avoid JarHell exceptions, I won't patch the original code but copy and adapt. If I get things working, I'd submit them to ES/Lucene as well. But I've no idea, if hyphenation is really used in the ES/Lucene community. Current behavior looks too odd, IMHO.

karussell mentioned this issue May 16, 2014

initial decompounder settings in separate analyser_de #46

Closed

christophlingg added this to the 0.3.0 milestone Mar 30, 2015

karussell mentioned this issue Oct 31, 2017

case sensitive search #225

Closed

karussell modified the milestones: 0.3.0, 0.4.0 Dec 21, 2017

hbruch mentioned this issue Jan 31, 2018

(WIP) Add custom hyphenation plugin #312

Open

karussell mentioned this issue Feb 25, 2018

avoid distance_sort and use proper distance unit and tuned formula #306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompound words #42

Decompound words #42

karussell commented May 16, 2014 •

edited

Loading

karussell commented May 16, 2014

hbruch commented Jan 17, 2018

karussell commented Jan 17, 2018

hbruch commented Jan 17, 2018

Decompound words #42

Decompound words #42

Comments

karussell commented May 16, 2014 • edited Loading

karussell commented May 16, 2014

hbruch commented Jan 17, 2018

karussell commented Jan 17, 2018

hbruch commented Jan 17, 2018

karussell commented May 16, 2014 •

edited

Loading