Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedupe is too strong #65

Open
mapsam opened this issue Mar 1, 2018 · 3 comments
Open

dedupe is too strong #65

mapsam opened this issue Mar 1, 2018 · 3 comments

Comments

@mapsam
Copy link
Contributor

mapsam commented Mar 1, 2018

Deduplication of features is too strong right now. Consider two buildings (polygons) as two unique features but they have few properties (or no properties) and no ID. For all intensive purposes these should be two unique features to avoid removing important data.

Perhaps it's best to only dedupe based on IDs for now, while we think about other ways to best dedupe with properties.

cc @flippmoke

@mapsam
Copy link
Contributor Author

mapsam commented Mar 1, 2018

Just a few extra examples of where deduping is working and not working.

Two buildings with the same properties and only the closest shows up in the results. Likely that they don't have IDs and we are using the properties to dedupe.

vtquery-diff-features

Two parks/baseball pitches that have the same exact properties but are deemed as different features. Likely due to their IDs being unique.

vtquery-same-features

This makes me think we should only compare properties of features across tiles, not features and properties in the same tile. This still doesn't satisfy the situation where two buildings across tiles have the same properties and would be considered duplicates, though. It would continue to sold tile boundary duplicates though.

@flippmoke
Copy link
Member

I am not sure there is a clear answer to the "right" way to do deduping. I think part of this is that it really depends on the type of data that exists:

Deduplication of features is too strong right now.

If you wanted to find the one closest building in OSM right now, it would be ideal to dedupe. If you wanted to find all the closest buildings, I feel that deduping might not be correct. The problems you have seen with multiple tiles does in fact make the results appear strange and I think it something we should heavily consider.

The problem comes down to the vast type of data that we can have in vector tiles. If you are attempting to find a specific rubber ducky that is closest to you, it can be quite complex. You could have a standard sized rubber ducky that fits quite well into a single tile, and it may be the only rubber ducky around.

image

However, you might also have a jumbo rubber ducky that spreads across multiple tiles and has false edges on it from the other tiles you query. In this case deduping is very good.

image

Additionally, there might be a set of rubber duckies in your tile and you want to know all the rubber duckies in your area. In this case deduping might be too agressive because it would think all ruber duckies are the same, because their properties are the same.

image

If all our rubber duckies have unique ids on them, then we do want to dedupe:

image

However, if they do not -- then we might be overwhelmed by the number of rubber duckies if we do not enable deduping.

image

Very simply put it is not always smooth sailing when you are looking for rubber duckies:

image

Therefore, I suggest that we allow users to decide if they want to dedupe or not. We could even set a flag for what type of deduping occurs.

@mapsam
Copy link
Contributor Author

mapsam commented Mar 1, 2018

@flippmoke 🦆 ❤️

Totally agree. There's no perfect solution (unless we start unioning geometries, which isn't out of the question, but is out of scope of this issue).

I like the idea of providing options to the user, and think we can do a good job at keeping it simple in the code base, especially since we have the logic written already.

Examples of dedupe options (not saying we have to implement them all):

  • none: don't perform any deduplication at all
  • id: dedupe ONLY on id
  • id+properties: dedupe when IDs are present, if not dedupe on properties
  • tiles: only dedupe across tiles

Maybe another way of breaking it out is:

  • none: no dedupe
  • soft: only dedupe on IDs
  • strong: dedupe on IDs, plus properties, and only across tiles

@mapsam mapsam added this to the v0.2.0 milestone Mar 1, 2018
@mapsam mapsam removed this from the v0.2.0 milestone Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants