wmgeolab / geoBoundaries

geoBoundaries : A Political Administrative Boundaries Dataset (www.geoboundaries.org)
http://www.geoboundaries.org
Other
282 stars 49 forks source link

[BOUNDARY ERRATA] Invalid GeoJSONs from v4 build #1735

Closed sgoodm closed 3 years ago

sgoodm commented 3 years ago

Several GeoJSONs from v4 data are broken: JPN_ADM0, NOR_ADM0, NZL_ADM0, PHL_ADM1

Seems to be tied to the format of the MultiPolygon geometries - I believe the Polygons within the MultiPolygon in the broken GeoJSONs are wrapped in one too many lists.

When reading the GeoJSONs using spatial software (fiona, geopandas, QGIS) it will fail due to the bad geometry, but you can read the file as raw JSON to debug.

p2 is the JSON object in the below Python code/output

Trying to recreate a MultiPolygon from the GeoJSON geometry field fails

>>> MultiPolygon(p2['features'][0]['geometry'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/userv/anaconda3/envs/deval/lib/python3.9/site-packages/shapely/geometry/multipolygon.py", line 62, in __init__
    self._geom, self._ndim = geos_multipolygon_from_polygons(polygons)
  File "/home/userv/anaconda3/envs/deval/lib/python3.9/site-packages/shapely/geometry/multipolygon.py", line 185, in geos_multipolygon_from_polygons
    assert N == 2 or N == 3
AssertionError

And if you try to build a MultiPolygon from the underlying elements it also fails:

>>> MultiPolygon([Polygon(p2['features'][0]['geometry']['coordinates'][i]) for i in range(len(p2['features'][0]['geometry']['coordinates']))]).is_valid
Traceback (most recent call last):
  File "shapely/speedups/_speedups.pyx", line 247, in shapely.speedups._speedups.geos_linearring_from_py
AttributeError: 'list' object has no attribute '__array_interface__'

Yet if we remove the outermost list of each element it works:

>>> MultiPolygon([Polygon(p2['features'][0]['geometry']['coordinates'][i][0]) for i in range(len(p2['features'][0]['geometry']['coordinates']))]).is_valid
True

Likely the most relevant line of code: https://github.com/wmgeolab/geoBoundaryBot/blob/main/gbBuild.py#L414

However it is not clear why this is not impacting more boundaries. That likely depends more on the source data and other boundary prep that I am not familiar with.

Additional symptom of this is also that metadata spatial stats are broken. See the nan values below, as well as the unit count being 0 (yet still showing non-zero vertices count oddly).

{"boundaryID": "JPN-ADM0-39117424", "boundaryName": "Japan", "boundaryISO": "JPN", "boundaryYearRepresented": "2017.0", "boundaryType": "ADM0", "boundaryCanonical": "Nippon-koku", "boundarySource-1": "OpenStreetMap", "boundarySource-2": "Wambacher", "boundaryLicense": "Open Data Commons Open Database License 1.0", "licenseDetail": "Open Data Commons Open Database License 1.0", "licenseSource": "www.openstreetmap.org/copyright", "boundarySourceURL": "wambachers-osm.website/boundaries/", "sourceDataUpdateDate": "Jul 27, 2021", "buildUpdateDate": "Aug 26, 2021", "Continent": "Asia", "UNSDG-region": "Eastern and South-Eastern Asia", "UNSDG-subregion": "Eastern Asia", "worldBankIncomeGroup": "High-income Countries", "apiURL": "https://www.geoboundaries.org/api/gbID/JPN-ADM0-39117424", "admUnitCount": "0", "meanVertices": "1688103.0", "minVertices": "1688103", "maxVertices": "1688103", "meanPerimeterLengthKM": "nan", "minPerimeterLengthKM": "nan", "maxPerimeterLengthKM": "nan", "meanAreaSqKM": "nan", "minAreaSqKM": "nan", "maxAreaSqKM": "nan", "downloadURL": "https://github.com/wmgeolab/geoBoundaries/raw/e5bd8b2f946574fdfb4d5b948ebc6ff90e359bb3/releaseData/gbOpen/JPN/ADM0/geoBoundaries-JPN-ADM0-all.zip", "gjDownloadURL": "https://github.com/wmgeolab/geoBoundaries/raw/e5bd8b2f946574fdfb4d5b948ebc6ff90e359bb3/releaseData/gbOpen/JPN/ADM0/geoBoundaries-JPN-ADM0.geojson", "imagePreview": "https://github.com/wmgeolab/geoBoundaries/raw/e5bd8b2f946574fdfb4d5b948ebc6ff90e359bb3/releaseData/gbOpen/JPN/ADM0/geoBoundaries-JPN-ADM0-PREVIEW.png", "simplifiedGeometryGeoJSON": "https://github.com/wmgeolab/geoBoundaries/raw/e5bd8b2f946574fdfb4d5b948ebc6ff90e359bb3/releaseData/gbOpen/JPN/ADM0/geoBoundaries-JPN-ADM0_simplified.geojson"}
sgoodm commented 3 years ago

Not sure if the MultiPolygon info above is actually driving this issue. A check with ogrinfo results in:

ERROR 1: GeoJSON object too complex, please see the OGR_GEOJSON_MAX_OBJ_SIZE environment option

sgoodm commented 3 years ago

See https://github.com/Toblerity/Fiona/issues/990

sgoodm commented 3 years ago

The following code works, indicating the file is in fact valid, but requires adjusting a common default environment variable in most applications.

>>> with fiona.Env(OGR_GEOJSON_MAX_OBJ_SIZE="2000MB"):
...     p4 = gpd.read_file('/home/userv/Downloads/geoBoundaries-JPN-ADM0-all/geoBoundaries-JPN-ADM0.geojson')
... 
>>> p4
  shapeName Level shapeISO                     shapeID shapeGroup shapeType                                           geometry
0     Japan  ADM0      JPN  JPN-ADM0-39117424B90657763        JPN      ADM0  MULTIPOLYGON (((123.78700 24.07181, 123.78694 ...
sgoodm commented 3 years ago

@DanRunfola I suggest just adding a note about dealing with this issue when using GeoJSONs. It doesn't seem to be relevant for other file formats, but will impact most folks using these few boundaries as GeoJSONs.

sgoodm commented 3 years ago

Example snippet:


import fiona
import geopandas as gpd 

path  = '/home/userv/Downloads/geoBoundaries-JPN-ADM0-all/geoBoundaries-JPN-ADM0.geojson'

# fails (silently - still shows correct bounds using fiona but has no features. With geopandas the bounds are nans)

with fiona.open(path, 'r') as vector:
    print(vector.bounds)
    print(len(vector))

with fiona.Env():
    vector = gpd.read_file(path)
    print(vector.total_bounds)
    print(len(vector))

# succeeds

with fiona.Env(OGR_GEOJSON_MAX_OBJ_SIZE="100000MB"):
    with fiona.open(path, 'r') as vector:
        print(vector.bounds)
        print(len(vector))

with fiona.Env(OGR_GEOJSON_MAX_OBJ_SIZE="2000MB"):
    vector = gpd.read_file(path)
    print(vector.total_bounds)
    print(len(vector))

May be worth creating an issue with fiona as this probably should not fail silently.

DanRunfola commented 3 years ago

Closing this with an addition to our wiki: https://github.com/wmgeolab/geoBoundaries/wiki/1.-Technical-Usage-Notes#technical-challenges