Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geography proposal: do whatever GADM (et al.) does #5076

Closed
dustymc opened this issue Sep 20, 2022 · 10 comments
Closed

geography proposal: do whatever GADM (et al.) does #5076

dustymc opened this issue Sep 20, 2022 · 10 comments

Comments

@dustymc
Copy link
Contributor

dustymc commented Sep 20, 2022

Refs #5063, #4928, #5022, etc, all aimed at solving/avoiding #4836

Let's elevate #5063 (comment) to an actual proposal:


If I get to pick I'd go with the "just use GADM (and such)" approach because I don't see anything else that looks viable. That would work out a lot like the Kuwait issue [in which the country was created without a continent].

  1. We as a community would decide
    • shall we use {source} as geography, and if so
    • how exactly would we map {source} to our geography (That wouldn't be much of a discussion for things like GADM - it's clearly a widely-accepted source, I can't imagine any possible reason we'd not want to use it, and the mapping is clean.)
  2. I would create geography (maybe preemptively) according to (1). (So GADM-based stuff would never have island because GADM doesn't include islands. Other sources would have other data/mappings.)
  3. Clean up, somehow move existing mishmash data to source-based spatial data (mostly my problem, but I'd need some guidance from time to time)
  4. WOOHOO problem solved!

There's no particular limit to the sources - eg @mkoo is working on island data (and I have no idea what it'll look like), the only real "rule" is that we would not try to mix-n-match. (A guideline might be that we try to stick to accepted sources when possible, but can see no real problems with creating our own if someone wants to make that investment.)

(Or maybe we CAN mix-n-match, but that seems to inevitably lead to inconsistency - eg, we can add a continent to US, Ohio but then it's structured differently than US, Hawaii. I suggest there is value in consistency, and there is very little - sometimes even negative - value in 'filling in the blanks.')

There would be many details to work out, I suggest we ignore them all for the moment and just decide if we can agree in principle to follow things like GADM as our "geographic authority" data.

Some examples of the data I have available right now, and how it might be mapped.

Continents - hopefully these wouldn't much be used for cataloging, but they can be used for things like spatial search - its possible to find all the things that map to eg Africa no matter what the geography assertion might be. The mapping to geog_auth_rec would be to continent.


select geog_string from external_gis_data where source='fs_continents' limit 10;
  geog_string  
---------------
 Africa
 Australia
 North America
 Oceania
 South America
 Antarctica
 Europe
 Asia

Seas


arctosprod@arctos>> select geog_string from external_gis_data where source='iho_world_seas' limit 10;
   geog_string   
-----------------
 Gulf of Alaska
 Bering Sea
 Chukchi Sea
 Beaufort Sea
 Labrador Sea
 Hudson Strait
 Davis Strait
 Baffin Bay
 Lincoln Sea
 Bristol Channel

"Marine areas" (some of which are inland) - lots of overlap with seas which might result in eg "Beaufort Sea" (mapped to geog_auth_rec.sea) + "ARCTIC OCEAN|BEAUFORT SEA" (mapped to ocean + sea) existing (and they probably have slightly different shapes, some of these seem to be arbitrarily drawn by hand). Not ideal, but seems like something we can work with, one way or another. Note also that some of these do not seem to be 'natural' (regulatory areas, perhaps??) and mapping those to our model would require discussion.

arctosprod@arctos>> select geog_string from external_gis_data where source='seavox_areas' limit 10;
                                          geog_string                                          
-----------------------------------------------------------------------------------------------
 ATLANTIC OCEAN|BAY OF FUNDY|NORTH ATLANTIC OCEAN|NORTHWEST ATLANTIC OCEAN (40W)|GULF OF MAINE
 NORTH AMERICA MAINLAND|BAY OF QUINTE|LAURENTIAN GREAT LAKES|LAKE ONTARIO
 ARCTIC OCEAN|BEAUFORT SEA
 MEDITERRANEAN REGION|ADRIATIC SEA|MEDITERRANEAN SEA|MEDITERRANEAN SEA, EASTERN BASIN
 MEDITERRANEAN REGION|AEGEAN SEA|MEDITERRANEAN SEA|MEDITERRANEAN SEA, EASTERN BASIN
 MEDITERRANEAN REGION|ALBORAN SEA|MEDITERRANEAN SEA|MEDITERRANEAN SEA, WESTERN BASIN
 SOUTHERN OCEAN|AMUNDSEN SEA
 PACIFIC OCEAN|ANADYRSKIY ZALIV|NORTH PACIFIC OCEAN|NORTHWEST PACIFIC OCEAN (180W)|BERING SEA
 INDIAN OCEAN|ANDAMAN SEA
 INDIAN OCEAN|ARABIAN SEA

Parks - surely there's something better, but I do have data for some (very arbitrary) public lands. Note that it's just the park in these data, but as with continents nonasserted geograpy could still be used to search. (That is, one could find some California things by searching 'Point Reyes NS' and all Point Reyes NS by searching California.)


arctosprod@arctos>> select geog_string from external_gis_data where source='naturalearthdata parks' limit 10;
        geog_string         
----------------------------
 Hawai'i Volcanoes NP
 Canaveral NS
 El Malpais NM
 Santa Monica Mountains NRA
 Channel Islands NP
 Point Reyes NS
 Redwood NP
 Yellowstone NP
 Olympic NP
 Mojave N PRES

EEZ+Land - these are a weird mishmash of sovereign (country), sorta-sovereign (Guam), and spatially convenient (Alaska) plus the (or bits of the??) associated EEZ. Mapping to geog_auth_rec would need discussed - https://arctos.database.museum/place.cfm?action=detail&geog_auth_rec_id=10016359 exists, I would not want to defend it.

select geog_string from external_gis_data where source='eez_land_union' limit 10;
                                                                           geog_string                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Estonia|Estonia|Estonia|Union EEZ and country
 Samoa|Samoa|Samoa|Union EEZ and country
 Tokelau|Tokelau|New Zealand|Union EEZ and country
 Overlapping claim Qatar / Saudi Arabia / United Arab Emirates|Qatar|Qatar|Saudi Arabia|Saudi Arabia|United Arab Emirates|United Arab Emirates|Overlapping claim
 Cameroon|Cameroon|Cameroon|Union EEZ and country
 Finland|Finland|Finland|Union EEZ and country
 Bassas da India|Bassas da India|France|Union EEZ and country
 Faeroe|Faeroe|Denmark|Union EEZ and country
 Gilbert Islands|Gilbert Islands|Kiribati|Union EEZ and country
 Overlapping claim: Venezuela / Colombia / Dominican Republic|Colombia|Colombia|Dominican Republic|Dominican Republic|Venezuela|Venezuela|Overlapping claim

And last but certainly not least, GADM. I've imported this as three levels, which would map to country (gadm0), country+state_prov (gadm1), and country+state_prov+ county (gadm2).



arctosprod@arctos>> select geog_string from external_gis_data where source='gadm0' limit 10;
  geog_string  
---------------
 Spain
 Indonesia
 Netherlands
 Colombia
 Côte d'Ivoire
 Morocco
 Peru
 Taiwan
 Philippines
 Haiti



arctosprod@arctos>> select SPLIT_PART(geog_string,'|',1) as country, SPLIT_PART(geog_string,'|',2) as state_prov from external_gis_data where source='gadm1' limit 10;
 country |         state_prov         
---------+----------------------------
 Spain   | Andalucía
 Spain   | Aragón
 Spain   | Cantabria
 Spain   | Castilla-La Mancha
 Spain   | Castilla y León
 Spain   | Cataluña
 Spain   | Ceuta y Melilla
 Spain   | Comunidad de Madrid
 Spain   | Comunidad Foral de Navarra
 Spain   | Comunidad Valenciana


arctosprod@arctos>> select SPLIT_PART(geog_string,'|',1) as country, SPLIT_PART(geog_string,'|',2) as state_prov , SPLIT_PART(geog_string,'|',3) as county from external_gis_data where source='gadm2' limit 10;
 country | state_prov | county  
---------+------------+---------
 Spain   | Andalucía  | Almería
 Spain   | Andalucía  | Cádiz
 Spain   | Andalucía  | Córdoba
 Spain   | Andalucía  | Granada
 Spain   | Andalucía  | Huelva
 Spain   | Andalucía  | Jaén
 Spain   | Andalucía  | Málaga
 Spain   | Andalucía  | Sevilla
 Spain   | Aragón     | Huesca
 Spain   | Aragón     | Teruel
(10 rows)

I believe @sharpphyl intends to discuss this at the next Arctos Office Hours, and I think at this point things like "this [ does | does not ] seem like a horrible idea" are very useful.

This looks like a workable idea to me, and that seems to be a hard bar to cross. I'm not sure I have an opinion beyond that, other than wanting to somehow end up in a situation where "geography authority" data all has a spatial representation (and I believe we all agreed to that in an AWG discussion).

Do note that there are no model changes proposed in here, simply guidelines for how we use (or, mostly, do not use) the long-existing model.

Alternative ideas which lead to spatial views of the world are of course most welcome.

Help!

@dustymc
Copy link
Contributor Author

dustymc commented Sep 20, 2022

From @sharpphyl in #5063 (comment):

I did stumble over Natural Earth in looking at GADM. It includes coastlines, oceanic areas, lakes, etc. Would it contribute anything to the discussion?

If it has things that can be made into postgis::geography, it can at least be discussed. (Naturalearth tends to be fairly low-resolution, which doesn't necessarily mean its not useful.)

@dustymc
Copy link
Contributor Author

dustymc commented Sep 20, 2022

One additional consideration, illustrated nicely by the screenshot maps in #4857: The data strongly suggest that some significant portion of the time, users choosing geography see what they're looking for and click. The label says Russia, they click Russia, and for whatever reason fail to realize (care??) that there are in fact three Russia[+continent] options (two of which are appropriate for any given point, but not necessarily shape, in Russia - the odds are in their favor!). This leads to many thousands of situations where the geography data disagree with the locality - https://arctos.database.museum/guid/MVZ:Bird:46708 for instance: the geography claims Asia, the locality is clearly not in Asia, what should I believe?

I suspect users will be much more likely to choose what they intend in an environment where we're not filling in the blanks because they exist and therefore only have one "Russia."

@tucotuco
Copy link

If you haven't yet considered GeoBoundaries as a source, have a look at it.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 21, 2022

Well I got it, but I feel like I'm missing something fundamental - there's no parentage. I can figure it out until I get to ADM2 (municipalities), and then I run into....

Screen Shot 2022-09-20 at 5 07 09 PM

seven things called 'Benito Juárez' because the Mexicans forgot to put a unique key on state.municipality_name. (Arctos claims there are 33 Washington Counties in the US, let no one think MX has something special going on.) I can sorta-usually-probably figure it out by comparing shapes, but that's a hell of a bug entry point. Surely I'm missing something?? (Maybe it's how I download - I grab the biggest possible chunks because I'm going to use primitive and clunky tools to stuff it into postgis, I think maybe everything would make sense if I use the API in some more iterative fashion, maybe from the real GIS system I don't have??)

I grabbed MX because GADM has a lot of gaps around Oaxaca, geoBoundaries has so far had what I need, thanks @tucotuco!

@mkoo you have any magic to offer??

Also, maybe even if we decide to reject this proposal: geoBoundaties requires attribution, I can probably stuff something into remarks and make them happy, but that seems wrong on a few levels - suggest we consider some more-formal 'source of the shape' addition to the model.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 22, 2022

From #5059 (comment) - perhaps change the idea of this to "follow the pattern of" rather than outright "do what they do" - what some of they do would not be very usable in a system such as Arctos, and requires a little massaging.

"do what geoboundaries does" (or what they do that I can access with my primitive tools, maybe) just isn't going to work - along with the necessity to manually figure out which "state" a "county" goes with, they've dumped everything in one string so I end up with things like Morocco:Province de Khémisset إقليم الخميسات and I just don't think I'm prepared to dump that into county (and these switch direction in the middle which breaks most every UI I can find, most especially Excel). We are going to have to standardize to some extent; we can still avoid mishmashing, but it's not going to be quite as simple as I'd hoped.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 22, 2022

In #5084 (comment) I proposed creating through gadm2 (county-level) by default. Having now been through much of the data for Vietnam, and enough bits and pieces of other places to suspect it's relatively representative, I'd like to retract that and propose the opposite for the following reasons:

There are at least dozens of VN wiki pages similar to https://en.wikipedia.org/wiki/H%E1%BB%93ng_Ng%E1%BB%B1_district:

On December 23, 2008, Hồng Ngự township, the communes of An Bình A, An Bình B, Bình Thạnh, Tân Hội and a portion of Thường Lạc commune were separated from the district to form the new district-level town of Hồng Ngự.

There are dozens more "stubs" - things that Wikipedia acknowledges to exist, but doesn't have much information on. We would not discover problems/changes to them.

GADM data tends to be a few years old and generally isn't qualified; it would take significant research to figure out what 'Hồng Ngự' might refer to. (Other sources have other quirks, but AFAIK nothing at all is much more than a shape with some sorta-ambiguous names attached.)

Nobody (perhaps except the local politician) seems to much care about these subdivisions. (If there's much information on Wikipedia, it almost always relates to cultural and historical aspects, often the city or feature after which the administrative unit was named, not the unit itself.)

At least in VN, the local divisions are physically small. Hồng Ngự district is about 80km^2 (vs. 2500 for Sacramento County as a point of reference).

I don't think there are significant functionality differences between geography Vietnam, Dong Thap, Hong Gnu + specific locality 'some description' and geography Vietnam, Dong Thap + specific locality 'Hong Gnu, some description.'

Essentially, I don't think we have the resources to adequately (much less properly!) manage 700+ second-level divisions of Vietnam, and I've come to believe that that pattern holds generally. I also don't think that pushing those data to locality has significant functional implications. I therefore propose to create GADM2-equivilant geography only when there is a particular reason and associated resources to do so: someone familiar with the area is willing to help manage geography, for example.

@sharpphyl your thoughts on this would be very appreciated.

@Jegelewicz
Copy link
Member

@dustymc I haven't read this, but it sounds like it might be helpful?
Geographic Name Resolution Service: A tool for the standardization and indexing of world political division names, with applications to species distribution modeling

@sharpphyl
Copy link

So GADM-based stuff would never have island because GADM doesn't include islands.

Can you clarify? I see the counties (which are islands) in Hawaii and Philippines such as Batangas.

GADM doesn't include islands that aren't an administrative subdivision (county etc.) of the larger area.

@dustymc
Copy link
Contributor Author

dustymc commented Sep 26, 2022

might be helpful?

That's all strings so not very useful to me.

clarify

The part of GADM we're interested in involves

  • Country-level equivalents
  • State-level equivalents
  • County-level equivalents

I'm sure that lots of those coincide with all sorts of other things, but that isn't recognized by GADM and so won't be recognized in our data.

https://gadm.org/maps/USA/hawaii_2.html does not coincide with an island, but https://gadm.org/maps/USA/hawaii/hawaii.html does. (And the former probably coincides with an island group, but that's even more of a mess in our 'legacy' data and I'm wondering if it's ever useful or spatially represented in anything.)

https://gadm.org/maps/PHL/batangas.html also doesn't correspond to an island - there are (or were, I think I cleaned this one up) lots of "island is most of the state-like-thing" with the island also listed in our data, and a fair number of them have records which map to the smaller islands. From here so far:

  1. Following spatially-supported things is easy, and
  2. Filling in the blanks leads to spatially-conflicting data (probably because nobody - the creators or the users - have been thinking spatially and generally have little way of knowing those small islands exists)

Some of that's detectable (by humans, probably) in the data

select island,spec_locality
from locality
inner join geog_auth_rec on locality.geog_auth_rec_id=geog_auth_rec.geog_auth_rec_id
where island is not null and spec_locality ilike '%island%' order by island,spec_locality

finds (8685 rows)

Yapen Island | Irian Jaya, Japen Island, Ambai Island
Vancouver Island | La Perouse Banks, west of Vancouver Island
Sulawesi | Desa Peleng, Peleng Island

@dustymc
Copy link
Contributor Author

dustymc commented Oct 17, 2022

Accepted as #5138

@dustymc dustymc closed this as completed Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants