Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is geography: intersections #4836

Closed
dustymc opened this issue Jul 18, 2022 · 22 comments
Closed

What is geography: intersections #4836

dustymc opened this issue Jul 18, 2022 · 22 comments
Labels
Accessibility Issue is related to Arctos accessibility. CodeTableCleanup Our bad data leads to more bad data. Fix it! Help wanted I have a question on how to use Arctos Priority-Critical (Arctos is broken) Critical because it is breaking functionality.

Comments

@dustymc
Copy link
Contributor

dustymc commented Jul 18, 2022

Is your feature request related to a problem? Please describe.

I'm trying to add spatial data to geography, some geography is intersections, I'm not sure how "geog-like" that is or why this has happened. (Because it's useful, or because things are sorted by eg quad - in which case those data might be just as useful elsewhere - locality attributes or part attribute location or ????)

Describe what you're trying to accomplish

Awesomeify data without wasting time on things that should be dealt with elsewhere.

Describe the solution you'd like

Magic, but I can't find any.

Describe alternatives you've considered

By way of example:

  1. I found geography for https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=10006854 (at https://www.naturalearthdata.com)
  2. https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=215 already had spatial data.
  3. I made data for https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=216 via the ST_Intersection of those two things, but that's a completely manual process with a LOT of room for error. (Feature Request - parent geography #3108 might have changed that, but it didn't).

Additional context

Something like 70% of the records using this particular (arbitrary) geography aren't contained by it, although the vast majority do intersect. Others may be better or worse, but I suspect that this is fairly typical. That leads me to believe these wouldn't be great 'best fit' candidates (if we ever get tired of manually fixing stuff and asserting things that can't be true and etc.), but I'm happy to figure this out if there's some use case.

@Nicole-Ridgwell-NMMNHS your perspective would be useful here - what's the point of all those quads (which overlap with counties)?

Priority

Functional impact largely depends where #4834 leads, but ANY sort of decision regarding what is or isn't "geography" would be absolutely amazing.

@dustymc dustymc added Enhancement I think this would make Arctos even awesomer! Help wanted I have a question on how to use Arctos labels Jul 18, 2022
@dustymc dustymc added this to the Needs Discussion milestone Jul 18, 2022
@Nicole-Ridgwell-NMMNHS
Copy link

what's the point of all those quads

Here is how we got there for the smaller quads: #2229

That was pre-locality attributes. I don't mind moving them over to locality attributes as long as we set up a code table. I was going to say that we should also transfer over all the map links that are in geography remark, but the links are dead 😢

If we did leave 7.5 and 15 minute quads in geography, we should be able to get spatial data from something like this: https://catalog.data.gov/dataset/usgs-map-indices-overlay-map-service-from-the-national-map

@dustymc
Copy link
Contributor Author

dustymc commented Jul 19, 2022

locality attributes as long as we set up a code table.

Makes sense, we could do that if we go there.

links are dead

Noice....

get spatial data

I can probably do something with that, but it almost certainly doesn't include the intersections that lead me here - eg https://arctos.database.museum/place.cfm?sch=geog&higher_geog=North%20Lake%207.5%20minute

I think there are three possibilities

  • move quads (maybe excepting AK, where they're nice county proxies) to localities (or "elsewhere")
  • keep quads, keep counties, but ban both on the same record - would be workable (ish, probably), but would leave two "valid" fairly-precise ways to say "there" and would mess with automagic best-fit operations and etc. I think I hate it, but maybe not any more than the rest of the jumble.
  • manually calculate the intersections. There are 688 of them involving quads, some of those are AK (as above), not impossible but doesn't sound all that much fun.

Thanks, very helpful, we still need to have a big-picture "what is geography" discussion....

I think https://arctos.database.museum/place.cfm?sch=geog&quad=Seward is the quad-weird winner with two spelling variations and 42 sub-quad-thingees.

-- duplicate quads
 select quad,count(*) from geog_auth_rec where quad is not null group by quad having count(*) > 1;

@Jegelewicz
Copy link
Member

maybe excepting AK, where they're nice county proxies

Honestly, this is what started the whole "quad" problem - if AK can have them why can't NM? Could we just put the AK quads in the "county" field to save having that argument with everyone?

Maybe even consider the following?

Now use instead
State Division 1
County Division 2

So that we aren't cramming provinces and krugs into "state" and "county"?

We are still going to run into issues of what is Division 1 and what is Division 2 for some countries....

@Nicole-Ridgwell-NMMNHS
Copy link

Thinking about this a little bit, I wonder if this could be solved by restructuring geography a bit - we could have the hierarchical, not intersecting geography as one thing, and have non hierarchical, potentially intersecting geography as a separate thing. The latter would be "things that it is useful to have spatial data for that don't fit neatly the main geography" - things like quads and features. For example, it would be helpful to have National Park spatial data in Arctos and even if you we not aware your locality was in a National Park, something would pop up - hey this locality is within current national park boundaries.

@dustymc
Copy link
Contributor Author

dustymc commented Jul 19, 2022

why can't NM?

2 differences - AK quads are 1:250k while the "small quads" seem to be a bit of anything that ever got printed on a map, and NM has normal-sized counties. (AK now has county-like-thingees too, but they're very recent and one's bigger than UT so not all that useful.) I'm not really arguing if anything should exist or not, but the spatiopolitical landscapes don't line up very well and we probably have to acknowledge that.

Maybe even consider t

#2876 (comment)

New geography model:

first_level_geog_term (continent, ocean, maybe even region - huge natural things)
second_level_geog (island group, sea, drainage - big natural things)
third_level_geog (island, waterbody, bay - small(er) natural things)
first_level_political_term (country)
second_level_political (state)
third_level_political (county)
special_political_thingee (national park, marine reserve)

and maybe quad, because it's pretty integrated into various workflows

I'm (obviously, I hope) willing to consider just about anything, but there are dozens if not hundreds of these conversations and none of them have lead anywhere so we're flopping around here in the middle. Can we fix that - somehow just come together as a Community, decide what is and isn't geography, and make that work?

issues of what is Division 1 and what is Division 2 for some countries

I think our "pick something and stick with it for the country, at least" approach works fairly well - we don't have to have a perfect global solution to have something significantly more usable than what we have now.

@dustymc
Copy link
Contributor Author

dustymc commented Jul 19, 2022

hierarchical, not intersecting geography as one thing, and have non hierarchical, potentially intersecting geography as a separate thing.

Hu, neat.

Would my half-baked proposal plus a trigger than allows only one "category" be functionally equivalent? (I'm not sure if that's a simplification or unnecessary complication).

And I suppose we'd want to allow multiple (just two?) geographies per locality?

something would pop up

See also just_use_best_match - given coordinates I can magic WHATEVER, maybe there's some viable 'only asserts coordinates' model out there waiting to be found.

current

#3018

@dustymc
Copy link
Contributor Author

dustymc commented Jul 20, 2022

I think this idea of separating political and geography should be elevated to a full proposal.

Given political stuff (US/HI) I can usually figure out the intent (state of Hawaii) and find appropriate spatial data.

Given geographical stuff (HI/HI) I can usually figure out the intent (archipelago, island) and find appropriate spatial data.

Given a random mix of those, everything conflicts with everything and nobody - especially us! - can figure out what we're talking about. (And spatial data isn't readily available for things like "European France" - there are practical reasons to do what the sources of spatial data have done, which generally involves not mixing concepts.)

If someone wants to assert both then they can do so via two localities.

If only one one is asserted, I can use spatial tools to make everything discoverable anyway.

Can that be forged into a workable model?

Does anyone else do anything like this? (I think not - they just deal with strings and pretend that "Hawaii" means whatever's convenient at the moment.) I don't know what will work well, but I'm increasingly certain that our mishmash can not be fully supported by spatial data, at least not without something like hiring a full-time GIS person (which doesn't seem remotely realistic).

@sharpphyl I think most of your collection would be right at the border of those two things (where all the interesting stuff happens!) - your thoughts (proposals for a better model, whatever) would be very appreciated.

@Jegelewicz
Copy link
Member

separating political and geography should be elevated to a full proposal.

This makes sense to me.

assert both then they can do so via two localities.

In a single event? That would be nice - this thing was collected on this day in this political place and geographic place (one may be inside the other or they may overlap, providing a more granular "higher geography")?

I like the idea of being able to select, one, the other, or both.

@dustymc
Copy link
Contributor Author

dustymc commented Jul 20, 2022

In a single event?

Not what I had in mind, but I suppose I'm up for anything. I really don't think there's much reason to do that, if we can come up with a model which better fits reality I should be able to use the spatial attributes to move back and forth across spaces.

And I'm a little paranoid about the whole "Kenya's moved all the borders but reused half the names, again" thing at the moment so isolating that seems like a Really Good Idea. (Wild guess: 20% of our 15K current geog entries carry some sort of temporally-involved ambiguity.)

@Jegelewicz
Copy link
Member

Wild guess: 20% of our 15K current geog entries carry some sort of temporally-involved ambiguity.

That's #3018

@dustymc
Copy link
Contributor Author

dustymc commented Jul 21, 2022

That's #3018

That's the technical bit, but the whole picture also involves management. I'm staring up an an overwhelming mass of evidence which suggests that just never(ish) happens. Keeping current could technically be "part of Arctos," but from the social side of things I can see no safe way to do that while we're also allowing self-conflicting data. (And I can't see how that might be resolved at this point.)

@dustymc
Copy link
Contributor Author

dustymc commented Jul 26, 2022

Drainages are a large part of the intersectional data.

Suggest removing drainage from the geography model in some way:

  1. Pull spatial drainage data (https://data.ca.gov/dataset/watershed-boundary-dataset-wbd)
  2. Dynamically add drainage to derived data, either as standard practice or on demand

Or, alternatively:

  • Move Drainage to locality attributes. (I have a list, controlling spelling would be trivial.)

(Or even more alternatively, propose some model in which drainage as geography makes sense. 'Drainage requires all other fields to be NULL' would do it.)

The first choice could add no work, and could not be self-conflicting (to the extent the underlying data are accurate). It would not be available to records which aren't georeferenced, but that's a very low bar in Arctos (click one button or ask me).

The second would allow 'verbatim assertions' and work with data of any quality.

@dustymc
Copy link
Contributor Author

dustymc commented Jul 26, 2022

A short hopefully-functional proposed solution to the quad problem: quads can only be accompanied by (continent_ocean,country,state_prov).

This finds things which conflict:

select higher_geog from geog_auth_rec where quad is not null and (
     county is not null or 
     feature is not null or 
     island is not null or 
     island_group is not null or 
     drainage is not null or 
     sea is not null 
)

currently (3611 rows), about 150 involving Alaska.

That is not incompatible with moving quads (all or some) to locality attributes, from where they could be mixed with anything as desired.

@Jegelewicz
Copy link
Member

The second would allow 'verbatim assertions' and work with data of any quality.

I like this idea even though it is more work. But those who will be using it need to speak up. @campmlc is Emily on Github? Who else? @mvzhuang

@Jegelewicz
Copy link
Member

quads can only be accompanied by (continent_ocean,country,state_prov).

BUT in the lower 48, one can narrow the spatial footprint with county + quad as some quads extend over two or more counties..

@dustymc
Copy link
Contributor Author

dustymc commented Jul 26, 2022

as some quads extend over two or more counties

The simple solution is "it ain't geography unless it's accompanied by spatial data" which would work for anyone who wants to calculate that overlap, but there's absolutely no way we're going to accomplish that and retain "legacy" geography, and so far nobody's stepped up offering to calculate those intersections.

narrow the spatial footprint

I don't think that's a direction The Community is comfortable (or capable of) going - #4289 - and geography is certainly not a necessary component of "narrowing."

@mvzhuang
Copy link

not that important for us to have quads and drainages in higher geography (it doesn't seem to make sense to me there anyway), but locality attributes would be useful. Carl doesn't think it's useful in higher geog either. Verbatim assertions could work. Just want it in in some way and easy to search by.

@dustymc
Copy link
Contributor Author

dustymc commented Aug 3, 2022

Intersecting geographies have been assigned single-component spatial data in production (probably when we first got WKT for quads). #4863 (reports) is trying to put https://arctos.database.museum/guid/UAM:Herb:109929 (Canada) in https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=4838 (claims Alaska, mostly maps to Canada).

Screen Shot 2022-08-03 at 7 23 08 AM

The inappropriate spatial data should be removed.

@sharpphyl
Copy link

@sharpphyl I think most of your collection would be right at the border of those two things (where all the interesting stuff happens!) - your thoughts (proposals for a better model, whatever) would be very appreciated.

I think the two things you're referencing are geographical and political "stuff." I think of most of our borders as wet and dry - that is in a water body (e.g, Gulf of Mexico) but needing to be linked to the dry land it's offshore of (e.g., Egmont Key, Florida). While it's interesting, I'm not sure resolving quads will do much for our marine records, but maybe I'm misinterpreting the intersection.

@dustymc
Copy link
Contributor Author

dustymc commented Aug 23, 2022

I'm bumping this up for AWG discussion, and will attempt to summarize here.

We have access to cool data which supports cool tools, we need to modernize our geography model (and viewpoint) to access them.

The slightly more technical core of the issue is mostly discussed above, the problematic geography largely consists of different "categories" overlapping each other. We can support 1:250K quads, we can support Counties, we can support Islands, mixing them all up results in placenames for which I cannot get spatial data, and so we cannot know if georeferences to those places are appropriate.

Here are a few examples from the most-used current nonspatial geography and how I'd handle them.

  • North America, Bering Sea, United States, Alaska, Rat Islands Quad, Aleutian Islands (33543 records)
    • drop island group (perhaps from the model, I don't think it ever clarifies when we're discussing spatial data)
  • North America, Bering Sea, United States, Alaska, St. Lawrence Quad, Saint Lawrence Island (32542 records)
    • we have data for the quad, @mkoo is getting islands for me, "both, but not spatially" is possible via locality attributes as discussed above, pick a target that works for your collection (or Arctos, if possible) and migrate. (More on "Dirt, but saltwater" below)
  • North America, United States, Utah, San Juan County, San Juan River (31547 records)
    • We have no drainage spatial data, and I don't think it's quite the same THING as other geography data; if it is geography then it probably can make sense only by itself (or maybe with continent, "North America, Colorado River Drainage")
  • North America, United States, Alaska, Petersburg Quad, Tongass National Forest (8862 records)
    • we have quads, we can (at least in theory) get "features," the intersections are problematic.
  • North America, Bristol Bay, United States, Alaska, Port Moller Quad (6513 records)
    • This is just a wet (ha!) mess from a spatial perspective. If it's North America then it can't also be Bristol Bay, states don't (usually, perhaps??) administer saltwater so Alaska and Bristol Bay don't mix - except in this case Alaska was elevated to country for the "Union EEZ and country" spatial data (from https://www.marineregions.org/downloads.php, I think), and WHATEVER might happen with the bay parts isn't going to mesh well with the "bits of the quad that hang off the edges of the land" intersection. And of course we can't control or even check anything we don't have data for, so the points map way up across the Kvichak from King Salmon for some reason, the wrong side of the Aleutians to be Bristol, off in the dirt, etc. - this is typical, I see the same pattern with many similar geography strings, I think users get to "Port Moller" or "Bristol Bay" or whatever's on their datasheet, then stop reading and pick something; I think these things reduce the quality of the data more than can easily be quantified.

Those 5 example records represent over 100,000 records excluded from many (most?) analyses, with another 7,000 plus geography records having the same limitations.

I don't think we need to do anything radical to the model, we just need to constrain ourselves to using the pieces of the model which represent accompanying spatial data. Under that view, someone who wants to bring a shape for the 5 square feet of some (quad+county+feature+drainage) overlap would be welcome to do so; the answer to "what is geography?" would be "something with spatial data." (Radical refactoring of the model might result in more consistent data and such, I certainly am not opposed to a complete rethink, but I'll settle for low-hanging fruit at the moment. And I don't think any interim more-spatial thing we might do could much conflict with any sort of radical refactor.)

Failing to do anything seems within the realm of possibilities, if we end up there I want my promised 48 point flashing red warnings - this seems like 100% social problem to me, we can get spatial data and confine georeferences to it, if we do anything less it's because we've chosen to, and users deserve to know that Arctos has the technical capability to produce and recognize Research Grade place-data.

(Semi-related, #4916 should be seen in the same vein - I've already georeferenced most everything, choosing not to expose and see how it fits with other stuff also seems like something we should tell users, loudly.)

EDIT: response to some comments from @sharpphyl in #4894

without a distinguishing Region or Province

There are two distinguishing features that should make these completely unambiguous

  1. The wiki link, and
  2. The spatial data

name the Mecca Province "Makka" so we should a

The actual name is مِنْطَقَة مَكَّة, https://handbook.arctosdb.org/documentation/higher-geography.html#guidelines-for-geographic-terms-in-arctos requires ASCII, whatever we do is a semi-arbitrary transcription, one's about as good as another. (My spatial data has one Makka and one Makkah...)

@dustymc dustymc added Priority-Critical (Arctos is broken) Critical because it is breaking functionality. CodeTableCleanup Our bad data leads to more bad data. Fix it! Accessibility Issue is related to Arctos accessibility. and removed Enhancement I think this would make Arctos even awesomer! labels Aug 23, 2022
This was referenced Aug 23, 2022
@dustymc
Copy link
Contributor Author

dustymc commented Sep 1, 2022

AWG discussed, consensus is to move in a spatial direction, it is understood that doing so will require a lot of conversion and cleanup.

I will be opening individual Issues to address specific problems as they're encountered.

@mkoo is working on munging island data into something I can use, which will allow disentangling lots of localities.

It was noted that Geography Shape Name provides a way to search by unasserted places; removing (island, continent, whatever) from "combo geography" does not affect discoverability.

At the moment I don't think any model adjustments are necessary, but how we view/use the model does need to change (and the documentation needs to reflect that). Essentially, I think that means asserting only what's really required instead of filling in all of the blanks; adding continent to Russia just reduces accessibility, so don't.

@dustymc
Copy link
Contributor Author

dustymc commented Oct 6, 2022

Merge-->#5138

@dustymc dustymc closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accessibility Issue is related to Arctos accessibility. CodeTableCleanup Our bad data leads to more bad data. Fix it! Help wanted I have a question on how to use Arctos Priority-Critical (Arctos is broken) Critical because it is breaking functionality.
Projects
None yet
Development

No branches or pull requests

5 participants