-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial libpostal integration #625
Conversation
@@ -2,7 +2,7 @@ | |||
var cluster = require('cluster'), | |||
app = require('./app'), | |||
port = ( process.env.PORT || 3100 ), | |||
multicore = true; | |||
multicore = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally yes but waiting on #601
Looks like the final commit, a merge of a branch on github that you had probably rebased locally (2744244), both duplicated commit history and added a few unwanted changes (like adding stats-lite as a dependency twice). We can fix it whenever you have a few minutes |
1f15da3
to
91804e8
Compare
I did a quick test with this branch, and found out that house number analysis is now working quite well, and we no longer get streets from wrong cities. However, I am quite concerned about the character conversion which libpostal does: ä -> ae etc. This breaks big part of searches. Any ideas how to fix this? |
@vesameskanen the libpostal models need to do those normalizations at some level so länderstraße, laenderstrasse, and landerstrasse can share statistical parameters. I'm not sure if Pelias is handling umlaut transliteration now (looks like it mostly does accent stripping, "Muenchen" for instance does not return Munich but "Munchen" does), but it should be possible to use libpostal without breaking any such searches by either indexing the transliterated forms from libpostal, or using the compound analyzer approach detailed here or here. From what I understand, Dutch and the Nordic languages usually drop the umlaut when writing on an ASCII keyboard, etc. with two exceptions in Swedish : ü => y and å => aa, but these are a little less frequently used. There has been some discussion in the libpostal repo about returning only segments of the original input without any lowercasing or transliteration (or only performing those normalizations on some internal representation) for people looking to display the parser results in some way. It's a little complicated to do that in the current implementation of the transliterator, and not a huge priority for search, but it's on the radar. |
Hi @thatdatabaseguy, many thanks for your reply and also for the great libpostal tool, which seems to parse also Finnish addresses amazingly well. |
OK, found out the following related pull requests: #pelias/schema#146 I guess a solution is under way :) |
3470a61
to
10b75f2
Compare
1e03840
to
adb15ed
Compare
this will make it easier to sort thru results from FallbackQuery by knowing which query was called
proxyquire is now used because the text_analyzer package requires node_postal which isn't guaranteed to be available
1ef2f70
to
768843b
Compare
Add street to trim by granularity
limited fallbackQuery usage to analysis with `street`
Confidence score fix
Add accuracy property to results
This is the big PR to incorporate libpostal functionality. It incorporates several large changes of functionality:
Because addressIt is still needed for the /autocomplete endpoint, a separate query/texetparser.js was added just for it. For a discussion on what FallbackQuery and Geodisambiguation, see: