Input unicode normalization affects search results #600

pmezard · 2016-07-26T12:10:01Z

Hello,

Take the text input "Chambéry, France". Assuming mapzen search supports unicode and query strings require UTF-8, it can be represented in 2 ways (ignoring compatibility forms):
1- NFC form : "Chamb\u00e9ry, France" or "Chamb%c3%a9ry%2c%20France" in an URL
2- NFD form: "Chambe\u0301ry, France" or "Chambe%cc%81ry%2c%20France" in an URL

They do not return the same results when passed to /search API. I would expect the endpoint to perform unicode normalization instead of delegating to clients (or have ES or whatever backend to be configured to deal with it).

Or to be clearly documented, if one form is preferred to the other (here NFC).

Any thoughts?

missinglink · 2016-07-26T14:53:44Z

hi @pmezard see pelias/schema#146

pmezard · 2016-07-26T15:29:23Z

Thank you for the link, @missinglink .

I suspect both issues are related but fixing mine looks less debatable: in general unicode normalization does not lose information and when it does, the impact can be ignored for a service such as mapzen search (it does not try to preserve document original content). The kind of normalization mentioned in the issue is not reversible, which causes the issue about searching Ö vs O vs Oe.

But I admit I am wildguessing at this point, I have no idea how backend data is processed/mapped.

missinglink · 2016-07-26T16:08:41Z

Maybe you can help me to understand the issue better?

I can try to answer some of your questions about the system internals first:

Pelias stores the original names verbatim as entered in the original data set, the original name is returned to the user in the results, so for the term Chambéry we should be returning: https://whosonfirst.mapzen.com/spelunker/id/85863275/#12/44.7589/-0.5956

If the source data contained the decomposed form then we would return the decomposed form.

Pelias is based on an inverted index, so at index time we tokenize the name and attempt to expand any contracted forms of the language.

So here we try our best to expand abbreviations and can also expand characters such as Ö to Oe, the problem I linked is that sometimes there are more than 2 synonymous forms: Ö, Oe and O are all valid ways to input umlauts, depending on the users keyboard and language proficiency.

I think you are correct in saying this is a different issue.

We currently use the ascii folding functionality in elasticsearch which seems to not have proper unicode normalization support.

When I decoded the name you provided above I get:

Chambéry, France
Chamb́ry, France

Looking over the wikipedia page it seems the official name is the first, I didn't see any reference to the second form.

So am I correct is saying that it is also commonly referred to as Chamb́ry (without an e char)?

Also my French is not very good, is it common for b́ to serve as a synonym for bé in the language?

Are you referring to combining characters such as \u0301 adding accents to the following character?

eg. in this case \u0301 transforms b to b́

It looks like in order to correctly normalize combining characters we will need to use the icu_normalizer; which in previous versions of elasticsearch required each installation to include an additional plugin, it might possibly be included in the core engine now.

pmezard · 2016-07-26T16:45:01Z

Sorry, I made a mistake in my example (which does not change the outcome fortunately). The second example should have been:

2- NFD form: "Chambe\u0301ry, France" or "Chambe%cc%81ry%2c%20France" in an URL

including the "e" before the combining acute accent".

You are right, I am comparing "LATIN SMALL LETTER E WITH ACUTE (U+00C9)" to its decomposition "LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301)".
Sorry for the confusion.

And if you want to do that at ES level, I believe using icu_normalizer plugin is the way to go.

missinglink · 2016-07-27T11:06:50Z

Great, no problem, I thought it might have been a typo, I've created a ticket to schedule it in my workload.

To answer your other question, unfortunately until that work is complete these entities will only be retrievable when the user enters the same form (composed/decomposed) as was provided in the source data.

Thanks for reporting.

missinglink · 2016-08-16T11:06:28Z

hi @pmezard this code has now been merged and is making it's way to the production environment.

I hope to have a staging build to test on Monday and if all goes well this could be in production middle of next week.

missinglink · 2016-08-22T12:29:51Z

merged

orangejulius · 2016-08-22T14:26:46Z

Compare links for when this makes it to production (it's not yet there):

NFC form
NFD form

pmezard · 2016-09-16T19:03:28Z

Sorry for the late response, I wanted to check the integration of the icu plugin in vagrant env first. Thank you for working on this !

orangejulius · 2016-09-16T19:19:22Z

It looks like this is fixed in production now!

NFC form

NFD form

However it's intersting that the accent is on a different character in the response body in the screenshots above...

pmezard · 2016-09-16T19:55:13Z

Yes this is suspect, you should cross-check that at byte level if you can. I have seen text rendering bug this week at my job, in Firefox, caused by a font not including combining accent character. But it caused them to be displayed as blanks, not being weirdly combined with another character.

orangejulius · 2016-09-16T19:59:49Z

Those screenshots above are from Firefox (on Linux) and there are definitely differences in the rendering. Hopefully that's not very relevant, but the changing accent location is interesting.

missinglink self-assigned this Jul 26, 2016

missinglink mentioned this issue Jul 26, 2016

use icu_normalizer pelias/schema#148

Closed

dianashk added the outreach label Jul 29, 2016

avulfson17 mentioned this issue Aug 1, 2016

Added Chambery tests pelias/acceptance-tests#273

Merged

missinglink closed this as completed Aug 22, 2016

missinglink removed the outreach label Aug 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input unicode normalization affects search results #600

Input unicode normalization affects search results #600

pmezard commented Jul 26, 2016 •

edited by dianashk

Loading

missinglink commented Jul 26, 2016

pmezard commented Jul 26, 2016

missinglink commented Jul 26, 2016 •

edited

Loading

pmezard commented Jul 26, 2016

missinglink commented Jul 27, 2016 •

edited

Loading

missinglink commented Aug 16, 2016

missinglink commented Aug 22, 2016

orangejulius commented Aug 22, 2016

pmezard commented Sep 16, 2016

orangejulius commented Sep 16, 2016

pmezard commented Sep 16, 2016

orangejulius commented Sep 16, 2016

Input unicode normalization affects search results #600

Input unicode normalization affects search results #600

Comments

pmezard commented Jul 26, 2016 • edited by dianashk Loading

missinglink commented Jul 26, 2016

pmezard commented Jul 26, 2016

missinglink commented Jul 26, 2016 • edited Loading

pmezard commented Jul 26, 2016

missinglink commented Jul 27, 2016 • edited Loading

missinglink commented Aug 16, 2016

missinglink commented Aug 22, 2016

orangejulius commented Aug 22, 2016

pmezard commented Sep 16, 2016

orangejulius commented Sep 16, 2016

pmezard commented Sep 16, 2016

orangejulius commented Sep 16, 2016

pmezard commented Jul 26, 2016 •

edited by dianashk

Loading

missinglink commented Jul 26, 2016 •

edited

Loading

missinglink commented Jul 27, 2016 •

edited

Loading