Skip to content

Address Parsing

Compare
Choose a tag to compare
@hkrishna hkrishna released this 23 Jul 19:34
· 2689 commits to master since this release

Address parsing is huge for an address geocoder and this release takes a first crack at it using AddressIt module. AddressIt is a freeform street address parser, that is designed to take a piece of text and convert that into a structured address that can be processed in different systems.

> var addressit = require('addressit')

> addressit('123 main st new york ny 10010 usa')
{ text: '123 main st new york ny 10010 usa',
  parts: [],
  unit: undefined,
  number: 123,
  street: 'main st',
  state: 'NY',
  country: 'USA',
  postalcode: 10010,
  regions: [ 'new york' ] }

Before the pelias API calls addressit for address parsing, it does some basic checks by parsing query to ensure that we dont slow things down drastically when unnecessary for example the following are the cases where we dont need address parsing -

  • input=a or input=au or input=aus - if the input has 3 or less characters, we could assume its not a fully formed address, in fact - we can go one step further by only targeting admin layers because if we return results such as austin, australia etc it should be relevant but more importantly fast.
  • input=boston or input=frankfurt or input=somereallybigname or input=new york - if the input is just one or even two tokens and does not contain a number - we can get away with just targeting admin and poi layers

In all other cases, we do address parsing and handle the address parts to query the ES index. Here's a sample mapping

number + street -> name.default
number -> address.number
street -> address.street
postalcode -> address.zip
state -> admin1_abbr
country -> alpha3
regions -> admin2

Sometimes, the address parser comes back empty handed

> addressit('123 chelsea, london')
{ text: '123 chelsea, london',
  parts: [],
  unit: undefined,
  state: undefined,
  country: undefined,
  postalcode: undefined,
  regions: [ '123 chelsea', 'london' ] }

In this case, we take fall back to the naive approach we implemented months ago - where we split the address based on a comma and assume everything that follows the comma is an admin part and add a match block in the should array. So, we query name.default with 123 chelsea and the should array in the query would try to match london with all the 5 admin fields

  • admin0
  • admin1
  • admin1_abbr
  • admin2
  • alpha3

All of this logic lives in helper/query_parser.js and is well documented with in-line comments. The query changes can be seen in query/search.js.

An additional 104 test cases were written to test out all the above mentioned logic and to test the query building - bringing the grand total of unit tests for the API to 708!