Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute an ngram field for all street names #364

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented May 20, 2019

This PR adds one new field called address_parts.street.ngram

It takes advantage of the fields mapping to generate ngrams for street names.

The ngram field is tokenized using a new tokenizers called peliasIndexStreetOneEdgeGram which is mostly the same as the peliasIndexOneEdgeGram analyzer except using different synonyms files, so as to produce prefix-ngrams which can be used for autocomplete.

The motivation here is to be able to, quite simply and efficiently, improve autocomplete queries which contain street names.
We currently do autocomplete on the name.default field, which mixes names and addresses, using this field with a multi_match will have benefits over the single-field approach:

  • Allow per-field analysis including synonym substitutions which are specific to streets
  • Allow for the fields to be included or excluded at query-time using a multi_match query

I suspect the changes required for the queries will be minimal.

Its likely that the index size will be increased after merging this PR because the street names will be indexed using an edge ngram filter twice, once for name.default and once for street.ngram.

The plan is, following this PR to remove the street names from the name.default field, once this has been done the index will return to the previous size (or a very similar on-disk size).

The analysis that I've configured in this PR is likely not perfect, but it mirrors what we already have, so the integration will be easier.
Once we've merge this PR and switched the queries we will be much freer to improve individual fields and analysis.

related: #347
related: #359

@missinglink
Copy link
Member Author

This is the plan to roll this out without breaking backwards compatibility:

  1. Update schema to add street.ngram field
  2. Update queries to use multi_match
  3. Queries should now work with street data either in name.default or street.ngram (backwards and forwards compatible)
  4. Check the codebase to ensure that Documents with no name are considered valid
  5. Remove osm and oa code which concats the housenumber and street name together.
  6. Remove name.default from multi_match where applicable (remove backwards compatibility)

@missinglink
Copy link
Member Author

This is a really cool feature but I didn't realise how much work would be involved in order to change the way name.default works.
In particular, there would need to be changed to how search works and also how labels are generated.

Let's leave this open for discussion, I'd still very much like to merge this one day because it's a big step forward for modernizing our schema design based on learnings over the last few years.

@missinglink
Copy link
Member Author

missinglink commented Jun 3, 2019

The increase in disk space for the whole planet was <20GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant