Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial libpostal integration #625

Merged
merged 82 commits into from
Oct 4, 2016
Merged

Conversation

trescube
Copy link
Contributor

@trescube trescube commented Aug 5, 2016

This is the big PR to incorporate libpostal functionality. It incorporates several large changes of functionality:

  • query/search.js - disables all scoring temporarily, switches to FallbackQuery/GeodisambiguationQuery
  • middleware/trimByGranularity.js - choose most granular results when FallbackQuery was used

Because addressIt is still needed for the /autocomplete endpoint, a separate query/texetparser.js was added just for it. For a discussion on what FallbackQuery and Geodisambiguation, see:

@@ -2,7 +2,7 @@
var cluster = require('cluster'),
app = require('./app'),
port = ( process.env.PORT || 3100 ),
multicore = true;
multicore = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes but waiting on #601

@orangejulius
Copy link
Member

Looks like the final commit, a merge of a branch on github that you had probably rebased locally (2744244), both duplicated commit history and added a few unwanted changes (like adding stats-lite as a dependency twice). We can fix it whenever you have a few minutes

@orangejulius orangejulius force-pushed the initial-libpostal-integration branch from 1f15da3 to 91804e8 Compare August 8, 2016 22:48
@vesameskanen
Copy link

I did a quick test with this branch, and found out that house number analysis is now working quite well, and we no longer get streets from wrong cities. However, I am quite concerned about the character conversion which libpostal does: ä -> ae etc. This breaks big part of searches. Any ideas how to fix this?

@albarrentine
Copy link

@vesameskanen the libpostal models need to do those normalizations at some level so länderstraße, laenderstrasse, and landerstrasse can share statistical parameters.

I'm not sure if Pelias is handling umlaut transliteration now (looks like it mostly does accent stripping, "Muenchen" for instance does not return Munich but "Munchen" does), but it should be possible to use libpostal without breaking any such searches by either indexing the transliterated forms from libpostal, or using the compound analyzer approach detailed here or here. From what I understand, Dutch and the Nordic languages usually drop the umlaut when writing on an ASCII keyboard, etc. with two exceptions in Swedish : ü => y and å => aa, but these are a little less frequently used.

There has been some discussion in the libpostal repo about returning only segments of the original input without any lowercasing or transliteration (or only performing those normalizations on some internal representation) for people looking to display the parser results in some way. It's a little complicated to do that in the current implementation of the transliterator, and not a huge priority for search, but it's on the radar.

@vesameskanen
Copy link

vesameskanen commented Aug 10, 2016

Hi @thatdatabaseguy, many thanks for your reply and also for the great libpostal tool, which seems to parse also Finnish addresses amazingly well.
I was already suspecting that the consequences of normalization need to be addressed on Pelias side. @trescube probably already has a plan how to handle this :) We are running a Pelias instance with Finnish data here, and try to keep our code base as well synced with upstream as possible, because Pelias is developing fast. Poking the actual ES queries Pelias creates is a bit challenging at this point.

@vesameskanen
Copy link

vesameskanen commented Aug 10, 2016

OK, found out the following related pull requests:

#pelias/schema#146
#pelias/schema#149

I guess a solution is under way :)

@dianashk dianashk added this to the Address Parsing milestone Aug 18, 2016
@trescube trescube force-pushed the initial-libpostal-integration branch from 3470a61 to 10b75f2 Compare August 22, 2016 01:48
@trescube trescube force-pushed the initial-libpostal-integration branch from 1e03840 to adb15ed Compare September 9, 2016 17:45
@trescube trescube force-pushed the initial-libpostal-integration branch from 1ef2f70 to 768843b Compare September 15, 2016 14:02
@trescube trescube merged commit 15593c2 into master Oct 4, 2016
@trescube trescube removed the in review label Oct 4, 2016
@trescube trescube deleted the initial-libpostal-integration branch November 18, 2016 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants