Releases: pelias/api
v3.1.2
v3.1.1
v3.1.0
v3.0.2
v3.0.1
Address Parsing
Address parsing is huge for an address geocoder and this release takes a first crack at it using AddressIt module. AddressIt is a freeform street address parser, that is designed to take a piece of text and convert that into a structured address that can be processed in different systems.
> var addressit = require('addressit')
> addressit('123 main st new york ny 10010 usa')
{ text: '123 main st new york ny 10010 usa',
parts: [],
unit: undefined,
number: 123,
street: 'main st',
state: 'NY',
country: 'USA',
postalcode: 10010,
regions: [ 'new york' ] }
Before the pelias API calls addressit for address parsing, it does some basic checks by parsing query to ensure that we dont slow things down drastically when unnecessary for example the following are the cases where we dont need address parsing -
input=a
orinput=au
orinput=aus
- if the input has 3 or less characters, we could assume its not a fully formed address, in fact - we can go one step further by only targeting admin layers because if we return results such asaustin
,australia
etc it should be relevant but more importantly fast.input=boston
orinput=frankfurt
orinput=somereallybigname
orinput=new york
- if the input is just one or even two tokens and does not contain a number - we can get away with just targetingadmin
andpoi
layers
In all other cases, we do address parsing and handle the address parts to query the ES index. Here's a sample mapping
number + street -> name.default
number -> address.number
street -> address.street
postalcode -> address.zip
state -> admin1_abbr
country -> alpha3
regions -> admin2
Sometimes, the address parser comes back empty handed
> addressit('123 chelsea, london')
{ text: '123 chelsea, london',
parts: [],
unit: undefined,
state: undefined,
country: undefined,
postalcode: undefined,
regions: [ '123 chelsea', 'london' ] }
In this case, we take fall back to the naive approach we implemented months ago - where we split the address based on a comma and assume everything that follows the comma is an admin part and add a match
block in the should
array. So, we query name.default
with 123 chelsea
and the should
array in the query would try to match london
with all the 5 admin fields
admin0
admin1
admin1_abbr
admin2
alpha3
All of this logic lives in helper/query_parser.js
and is well documented with in-line comments. The query changes can be seen in query/search.js
.
An additional 104 test cases were written to test out all the above mentioned logic and to test the query building - bringing the grand total of unit tests for the API to 708!
Deleting code is so much fun
Code cleanup - deleted all suggester related code (843 deletions) FTW!
Tech Debt - Better 408/500 error handling
Minor cleanup -> minor speedup
Minor cleanup, minor speedup and minor performance improvement - brought to you by:
- removed exact_match script
- increased search radius to 500kms
NGRAMS
This release is a big one, we are using ngrams to analyze/tokenize & are officially moving away from using the context suggester that is memory intensive and wasn't letting us build an autocomplete suggester on a global scale
Some major Features:
- partial matching using the ngrams approach ftw! https://www.elastic.co/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html
- better support for geohashes https://github.com/pelias/schema/blob/ngram/mappings/partial/centroid.js
- explicit definitions of how field data is to be stored
- improved punctuation https://github.com/pelias/schema/blob/ngram/punctuation.js
- improved synonyms: https://github.com/pelias/schema/blob/ngram/street_suffix.js