-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input unicode normalization affects search results #600
Comments
hi @pmezard see pelias/schema#146 |
Thank you for the link, @missinglink . I suspect both issues are related but fixing mine looks less debatable: in general unicode normalization does not lose information and when it does, the impact can be ignored for a service such as mapzen search (it does not try to preserve document original content). The kind of normalization mentioned in the issue is not reversible, which causes the issue about searching Ö vs O vs Oe. But I admit I am wildguessing at this point, I have no idea how backend data is processed/mapped. |
Maybe you can help me to understand the issue better? I can try to answer some of your questions about the system internals first: Pelias stores the original names verbatim as entered in the original data set, the original name is returned to the user in the results, so for the term If the source data contained the decomposed form then we would return the decomposed form. Pelias is based on an inverted index, so at index time we tokenize the name and attempt to expand any contracted forms of the language. So here we try our best to expand abbreviations and can also expand characters such as I think you are correct in saying this is a different issue. We currently use the ascii folding functionality in elasticsearch which seems to not have proper unicode normalization support. When I decoded the name you provided above I get:
Looking over the wikipedia page it seems the official name is the first, I didn't see any reference to the second form. So am I correct is saying that it is also commonly referred to as Also my French is not very good, is it common for Are you referring to combining characters such as eg. in this case It looks like in order to correctly normalize combining characters we will need to use the icu_normalizer; which in previous versions of elasticsearch required each installation to include an additional plugin, it might possibly be included in the core engine now. |
Sorry, I made a mistake in my example (which does not change the outcome fortunately). The second example should have been:
including the "e" before the combining acute accent". You are right, I am comparing "LATIN SMALL LETTER E WITH ACUTE (U+00C9)" to its decomposition "LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301)". And if you want to do that at ES level, I believe using icu_normalizer plugin is the way to go. |
Great, no problem, I thought it might have been a typo, I've created a ticket to schedule it in my workload. To answer your other question, unfortunately until that work is complete these entities will only be retrievable when the user enters the same form (composed/decomposed) as was provided in the source data. Thanks for reporting. |
hi @pmezard this code has now been merged and is making it's way to the production environment. I hope to have a staging build to test on Monday and if all goes well this could be in production middle of next week. |
merged |
Sorry for the late response, I wanted to check the integration of the icu plugin in vagrant env first. Thank you for working on this ! |
Yes this is suspect, you should cross-check that at byte level if you can. I have seen text rendering bug this week at my job, in Firefox, caused by a font not including combining accent character. But it caused them to be displayed as blanks, not being weirdly combined with another character. |
Those screenshots above are from Firefox (on Linux) and there are definitely differences in the rendering. Hopefully that's not very relevant, but the changing accent location is interesting. |
Hello,
Take the text input "Chambéry, France". Assuming mapzen search supports unicode and query strings require UTF-8, it can be represented in 2 ways (ignoring compatibility forms):
1- NFC form : "Chamb\u00e9ry, France" or "Chamb%c3%a9ry%2c%20France" in an URL
2- NFD form: "Chambe\u0301ry, France" or "Chambe%cc%81ry%2c%20France" in an URL
They do not return the same results when passed to /search API. I would expect the endpoint to perform unicode normalization instead of delegating to clients (or have ES or whatever backend to be configured to deal with it).
Or to be clearly documented, if one form is preferred to the other (here NFC).
Any thoughts?
The text was updated successfully, but these errors were encountered: