Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Including city name in forward geocoding text search not working as expected. #107

Open
gagandeepsingh1105 opened this issue May 21, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@gagandeepsingh1105
Copy link

Hi there,

I am an engineer at Public Health Agency of Canada. We currently have a use case for which we are looking to deploy an instance Pelias Geocoder.
For this use case, we have some custom input data(a csv file) of Canada locations only and we want to use Pelias Geocoder's forward geocoding to convert the text address to longitudes and latitudes.
And for this reason we are trying to deploy csv-importer. Below is the snapshot of input data that we have ingested into our elastic search instance:
image

While using forward geocoding if we supply street number, street name and province , then the api returns the response with confidence level =1 and source =custom:

Api request: https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr nl"&sources=custom
image

But if we also include the city name in the input text, then the confidence level drops to 0.6 and the match type changes to fall back. As you may have already noted that we do have a column named 'city' in our input data but somehow csv-importer is not able to read it and falls back to whosonfirst data source.

We have tried a couple of things at our end to resolve this issue:

  1. In the pelias.json configuration file , we added a "docs" key to map the columns in the csv file with those in pelias schema but got the following error:

image

Snapshot of pelias.json file:
"csv": {
"datapath": "/data/csv-importer-files",
"files": ["NLFD_test_changed.csv"],
"docs": [
{
"name": "LAT",
"type": "number",
"required": true
},
{
"name": "LON",
"type": "number",
"required": true
},
{
"name": "SOURCE",
"type": "number",
"required": true
},
{
"name": "LAYER",
"type": "number",
"required": true
},
{
"name": "NUMBER",
"type": "string",
"required": false,
"es_field": "address.number"
},
{
"name": "STREET",
"type": "string",
"required": false,
"es_field": "address.street"
},
{
"name": "CITY",
"type": "string",
"required": false,
"es_field": "address.city"
},
{
"name": "NAME",
"type": "string",
"required": false,
"es_field": "address.name"
},
{
"name": "MAIL_PROV_ABVN",
"type": "string",
"required": false,
"es_field": "address.region"
},
{
"name": "POSTALCODE",
"type": "string",
"required": false,
"es_field": "address.postalcode"
}
],
"download": []
}

  1. Also, tried to give the column mapping in a separate file but that too didn't work and got the same error again

image

Snapshot of pelias.json file
{
"imports": {
"csv": {
"datapath": "/data",
"files": [
"canada-locations.csv"
],
"mappings": "/code/csv_mapping.json"
}
}
}

and then defined the column mappings in a separate file:
{
"mappings": {
"id": "id",
"latitude": "latitude",
"longitude": "longitude",
"number": "house_number",
"street": "street",
"city": "city",
"region": "region",
"province": "province",
"country": "country",
"postalcode": "postalcode",
"category": "category",
"name": "name",
"layer": "address"
}
}

Steps to Reproduce

  1. Deploy an instance of Pelias Geocoder with csv-importer running
  2. Make the above mentioned configuration changes in pelias.json file.
  3. Try the following Api calls:
    https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr nl"&sources=custom
    https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr st john's nl"&sources=custom

Expected behavior
Including city name in the search text should also give confidence=1 and source=custom

Environment (please complete the following information):
We are currently running an instance of Pelias Geocoder on a kubernetes cluster on Google Cloud Platform

Please do let us know in case you require any additional information to debug this issue.
Thanks in advance.

@missinglink
Copy link
Member

Hi @gagandeepsingh1105, the 'administrative hierarchy' (ie. the city/province/country) of each record in Pelias is sourced exclusively from the WhosOnFirst dataset through point-in-polygon lookups at index time.

@missinglink
Copy link
Member

I believe this is a duplicate of #74

@missinglink
Copy link
Member

missinglink commented May 24, 2024

I'm not against adding this option to custom builds, the issue is that currently all administrative regions are composed of a source, id and term (with an optional abbreviation).

We could use 'custom' as the source, but each admin region would need to have a unique id in order to correctly generate the _gid field.

An autoincrement value could work here but would have the disadvantage that two places in the same area would have differing parent IDs.

@missinglink
Copy link
Member

missinglink commented May 24, 2024

It's possible to have multiple associated 'parents' for a single layer, so for example a record can have multiple 'region' records associated.

The issue would be that we only return one (ie. the first one), so it would either need to be decided (or configurable) whether the record from the CSV file was returned, or the WOF one, in the case where both data sources returned a match.

@the-epeecurean
Copy link

Hello,

I am a developer on the original poster's team. I think this is an issue of how WOF is passed back as the first record returned, or how readily it is searched for a 'fallback' match, if a locality name is present despite a focus on a more granular location.

I performed the same two searches in the original post excluding the "sources=custom" filter from the API call and encountered the same behaviour. A search for "283 Prince Philip dr NL" (https://geocoder.alpha.phac.gc.ca/api/search?text="283%20prince%20philip%20dr%20NL") resulted in a match from the custom source with confidence 1.0.

However, a search for "283 Prince Philip dr St. John's NL" results in a match from WOF, and seemingly ignores a filter on the address layer type:
https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22
OR
https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22&layers=address

We'd like to use the custom data source in performing batch forward geocoding, and it is useful to pass an 'address, city, province' search term where the inclusion of the city helps refine the search. As identified in the original issue, this does not appear to be what is happening due to the inclusion of the city name.

We understand that WOF is the exclusive source for administrative hierarchy in Pelias, but the inclusion of the place name shouldn't cue the fallback behaviour when an accurate match to the desired layer granularity (street address) is available. In this scenario a street address supplemented by a city name should refine the area for a search, but it seems that it prompts a fallback match instead. It also seems to ignore a layer search filter in the API call when the city name is included, triggering the returned fallback result from WOF.

Thank you for your help!

@missinglink
Copy link
Member

missinglink commented May 31, 2024

The debug query param displays a bunch more info:
https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22&layers=address&debug=1

You can see that the Placeholder service ran, it found a matching locality:

{
  "controller:placeholder": [
    {
      "id": 890456615,
      "name": "St. John's",
      "placetype": "locality",
      "population": 99182,
      "lineage": [
        {
          "country": {
            "id": 85633041,
            "name": "Canada",
            "abbr": "CAN",
            "languageDefaulted": false
          },
          "county": {
            "id": 1158869009,
            "name": "Division No. 1",
            "languageDefaulted": false
          },
          "locality": {
            "id": 890456615,
            "name": "St. John's",
            "languageDefaulted": false
          },
          "region": {
            "id": 85682123,
            "name": "Newfoundland and Labrador",
            "abbr": "NL",
            "languageDefaulted": false
          }
        }
      ],
      "geom": {
        "bbox": "-52.72931,47.54494,-52.68931,47.58494",
        "lat": 47.56494,
        "lon": -52.70931
      },
      "languageDefaulted": false
    }
  ]
}

Then when the Elasticsearch query is run, the ID of the locality matched above is added as a Filter condition (ie. mandatory condition):

{
  "filter": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "terms": {
            "parent.locality_id": [
              "890456615"
            ]
          }
        }
      ],
      "must": [
        {
          "terms": {
            "layer": [
              "address"
            ]
          }
        }
      ]
    }
  }
}

Of course this results in 0 hits:

{
  "controller:search": {
    "queryType": {
      "address_search_using_ids": {
        "es_took": 36,
        "response_time": 42,
        "retries": 0,
        "es_hits": 0,
        "es_result_count": 0
      }
    }
  }
}

At this point there are zero matches, I forget the exact workflow here but I believe it falls back to a legacy search method which was more lenient.

I don't like that the request specifies only address layers but returns other layers, this is likely a bug, but one which doesn't often occur outside of custom installations such as this.

@missinglink
Copy link
Member

missinglink commented May 31, 2024

The geometry of 890456615 St. John's is of type Point, which explains why the address wasn't associated via the PIP service. (the address must lie inside the boundary)

@missinglink
Copy link
Member

missinglink commented May 31, 2024

Maybe for your usecase you can disable the Placeholder service, or possibly don't add any data to it?
I haven't tested it, but it might prevent the filter condition being added to the elasticsearch query, which sounds like what you want.

@missinglink
Copy link
Member

@the-epeecurean are there better open geo data for that region?

the only one I can find is points only, does the CA govt publish something better than this? https://opendata.gov.nl.ca/public/opendata/page/?page-id=datasetdetails&id=265

@the-epeecurean
Copy link

@missinglink There are ... Statistics Canada publishes a hierarchy of delineated boundaries. I've just been evaluating some cherry-picked WOF 'fallback' results we've been seeing in testing.

Here's a link to an open REST point for the collected Cartographic Boundary files published by Statistics Canada:
https://geo.statcan.gc.ca/geo_wa/rest/services/2021/Cartographic_boundary_files/MapServer

And a reference to descriptions of the Cartographic Boundary files made available (at the bottom under "1. Spatial information products"):
https://www150.statcan.gc.ca/n1/pub/92-196-x/92-196-x2021001-eng.htm

A polygon for the example cited in the Issue above (St. John's NL) appears at the CSD (census subdivision) and CMA (census metropolitan area) levels.
However, some smaller localities (within a larger CMA, e.g., Halifax, NS) show up as polygons in the DPL (designated place) boundary file.

If there is any way that we could help in facilitating this spatial information being included in WOF, please let us know. It would help our usecase greatly to see a broader capture of localities in Canada represented as polygons.

@nvkelso
Copy link

nvkelso commented Jun 3, 2024

Adding an issue upstream in Who's On First to help facilitate this work:

tl;dr the new 2021 cartographic boundary files from Stats Canada look great and we'd love to import them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants