Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompound words #42

Open
karussell opened this issue May 16, 2014 · 4 comments
Open

Decompound words #42

karussell opened this issue May 16, 2014 · 4 comments
Milestone

Comments

@karussell
Copy link
Collaborator

karussell commented May 16, 2014

One more normalization has to be done to improve search. E.g. Erlangerstraße will be split into "Erlanger straße". This has to be done while indexing and searching.

There is a plugin but it is GPL due to one used library, it could be less restrict but phyton code has to be ported to Java: jprante/elasticsearch-analysis-decompound#5

For POIs there is also often the Bahnhof vs. Hauptbahnhof problem. But probably we should get a main railway station that is not named like this also important in a different country. But probably this should be handled via a different fix: #318

@karussell
Copy link
Collaborator Author

Hmmh, the plugin could be overhead as we only need decompounding for a small subset of the German language. Because it could influence the relevance negativly if we decompound 'baerwaldstraße' into "baer", "wald" and "straße". So we should just use the normal decompound stuff from elasticsearch and provide our own world list

@hbruch
Copy link
Collaborator

hbruch commented Jan 17, 2018

I explored elasticsearch's hyphenation_decompounder. Currently, there exist some issues which need further work:

  • ES currently requires an explicit dictionary (word_list) that must contain all subwords to be returned as tokens. Underlying lucene token filter does not, so a custom plugin that instantiates the token filter without word list would work (see this discussion).
  • the hyphenation token filter returns all subwords with offsets identical to the compound word, which results in treating all subwords as synonyms in the query phase. In consequence e.g. searching 'Erlangerstraße' would treat 'erlanger' and 'strasse' as synonyms, which is not intended (see same discussion).
  • lucene's hyphenation pattern handling has a bit shifting issue (see LUCENE-8124), so patterns are restricted to hyphenation markers 1 to 6. As we'd provide a custom hyphenation pattern file, so this just a minor, avoidable issue.

I'd prepare a WIP PR. The custom plugin should become an artifact on it's own, so what would you recommend: creating a github project on it's own? creating it as maven submodule?

@karussell
Copy link
Collaborator Author

Thanks, sounds good to me! I'll then close #46?

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

creating it as maven submodule?

I would prefer this, yes.

@hbruch
Copy link
Collaborator

hbruch commented Jan 17, 2018

Thanks, sounds good to me! I'll then close #46?

Yes. I'll reuse the patterns in de.txt and see if applying the decompounder only for .de. helps.

Is a separate plugin necessary because of the required internal changes? Or could we utilize it directly in photon somehow?

Yes, internal changes are required to allow a null word_list tweak the subword offsets. To avoid JarHell exceptions, I won't patch the original code but copy and adapt. If I get things working, I'd submit them to ES/Lucene as well. But I've no idea, if hyphenation is really used in the ES/Lucene community. Current behavior looks too odd, IMHO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants