wmgeolab / geoBoundaries

geoBoundaries : A Political Administrative Boundaries Dataset (www.geoboundaries.org)
http://www.geoboundaries.org
Other
285 stars 51 forks source link

[FEATURE REQUEST] Persistent administrative entity identifiers #3672

Open jacobwhall opened 8 months ago

jacobwhall commented 8 months ago

TL;DR: I would like to associate geoBoundaries data with other datasets. This is difficult to do because geoBoundaries does not persist boundary identifiers across versions. I suggest that geoBoundaries introduce persistent identifiers for administrative entities.

The 2020 geoBoundaries paper states in its opening paragraph:

The database is standardized using ISO 3166-1 alpha-3 encoding, and every boundary has a globally unique ID, allowing for integration with large-scale computational workflows.

The "globally unique ID" for each shape is described in this table:

The boundary ID, followed by the letter ‘B’ and a unique integer for each shape which is a member of that boundary.

...which glosses over the volatility of shape identifiers:

geoBoundaries version shapeID for Richmond, VA
v3.0.0 USA-ADM2-3_0_0-B672
v4.0.0 USA-ADM2-92793851B43358342
v5.0.0 52423323B78509502983349
v6.0.0 52423323B61032845323419

There is no (documented) way to reliably link geoBoundaries entities with data from other sources, or even with those from previous versions of geoBoundaries. Matching boundaries based on shapeName is bound to run into difficulties with regard to formatting, language differences, and formal name changes.

I will continue using Richmond as an example. There are many databases that catalog administrative entities, here are a few:

Many more are listed on Richmond's Wikidata item page, which itself has the permanent reference Q43421.

If Richmond annexes more of Chesterfield, its identifier in the above databases is unlikely to change. I understand that many boundaries tracked in geoBoundaries are ever-changing, yet there remains a need for persistent identifiers. This would make it much easier to associate shapes in geoBoundaries with their associated entities in the above databases.

I believe there are two options for accomplishing this:

  1. Create new, persistent identifiers for each administrative boundary geoBoundaries tracks for external datasets to reference
  2. Reference an external dataset's identifiers in the metadata of each boundary in geoBoundaries

The first option might be the easiest to implement. The persistent identifiers could be added to Wikidata for example, enabling cross-dataset queries. This would allow for complex metadata to be associated with geoBoundaries.

Thank you for your consideration!

DanRunfola commented 8 months ago

This is a really hard problem, because we want to ensure that unique geometries have unique codes - i.e., if you have the same geoboundaries ID, then you should be able to assume that the underlying geometry has not changed. Today, we actually hash the geometry itself to make the code, which is why you see changes - a change in our ID means that the geometry has changed. The problem here is that, of course, most changes are fairly small, which is really just resulting in an ID system that is highly instability (which is also undesirable).

We've discussed this a bunch with a range of actors, and what we're currently thinking is something like (lots of details that need to be figured out):

1) Create a grid across the globe for each administrative level, at a resolution fine enough that it guarantees no two administrative units would overlap in the grid their centroid falls into (possibly a dynamic resolution implementation, where we start course and split as needed). 2) Identify what grid cell the centroid of a given unit falls into. 3) Create a persistent ID based on the combination of (A) the ISO code, (B) the ADM level, and (C) the grid ID. So an identifier would be something like "USA-ADM2-209948". The only case in which that would change is if the geometry changes enough that the grid cell it's centroid falls into changes, which would hopefully be a valid reason to change things up.

This would also allow us to provide a geometric-based join to other cases (i.e., UN SALB or P-Codes from OCHA) through a similar matching process to their datasets.

Basically: a "coarser resolution" version of what we do now, which would result in more stability at the cost of IDs not changing with every geometric shift.

Edit: Also, keep in mind that for much of the world we do not have place-names (or they are highly uncertain / unstable). So the ID has to be generated without text-based metadata, which is where the challenge comes in.

This sounds like a good dissertation chapter, by the way :)

jacobwhall commented 8 months ago

Thanks for your response @DanRunfola

we want to ensure that unique geometries have unique codes - i.e., if you have the same geoboundaries ID, then you should be able to assume that the underlying geometry has not changed

This is an excellent idea, and geoBoundaries should continue to create geometry-specific identifiers. If a shape changes even a little bit, I think it is valuable for data consumers to see this change reflected in that shape's identifier.

Create a persistent ID based on the combination of (A) the ISO code, (B) the ADM level, and (C) the grid ID. So an identifier would be something like "USA-ADM2-209948". The only case in which that would change is if the geometry changes enough that the grid cell it's centroid falls into changes, which would hopefully be a valid reason to change things up.

This would also allow us to provide a geometric-based join to other cases (i.e., UN SALB or P-Codes from OCHA) through a similar matching process to their datasets.

Basically: a "coarser resolution" version of what we do now, which would result in more stability at the cost of IDs not changing with every geometric shift.

I think this approach could work as a supplement to the geometry hashes (or UUIDs) you established a need for above. This would provide sufficient persistence, making it worth the time to join geoBoundaries with a dataset like Wikidata. However, I wonder why a geometry-based approach is the best solution here. Why would geoBoundaries avoid direct relations with well-known administrative entity identifiers such as those I listed above?

Also, keep in mind that for much of the world we do not have place-names (or they are highly uncertain / unstable). So the ID has to be generated without text-based metadata, which is where the challenge comes in.

I understand that it may not always be possible to provide an administrative entity identifier. However, I suspect a vast majority of geoBoundaries data could be directly linked to existing entries in the databases I listed above. Am I underestimating how many places have uncertain names?

This sounds like a good dissertation chapter, by the way :)

Haha, I'd be happy to write a paper on this.