wmgeolab / geoBoundaries

geoBoundaries : A Political Administrative Boundaries Dataset (www.geoboundaries.org)
http://www.geoboundaries.org
Other
281 stars 49 forks source link

Documentation/bug related to ADM0 simplification #2326

Closed chrowe closed 2 years ago

chrowe commented 2 years ago

Describe the Error I would expect ADM0 and ADM1 to have the same country boundaries but this does not seem to be the case (at least for IND and USA which I tested). It looks like the main ADM0 boundaries are the same as the simplified ones. Maybe this is intentional but if so it would be good to have some documentation as to why.

Screenshots

image

geoBoundaries-comparison.pdf

link (doesn't seem to maintain the settings I select)

Additional context I was also comparing the files in Github

chrowe commented 2 years ago

For contrast I was looking at https://dataverse.harvard.edu/file.xhtml?fileId=5614858&version=2.0 and these are much higher resolution.

DanRunfola commented 2 years ago

This is a known issue in the database - i.e., we don't have hierarchical matches across layers. This has been an ongoing conversation for some time, and on our list of activities to tackle in the future. We did a prioritization a year or two ago based on user feedback, and ended up pushing for a more uniform license (CC-BY) before tackling the hierarchical standardization issue.

In the interim, it's very helpful to have identified cases where hierarchy is not standardized raised, as that gives us a future roadmap of issues to tackle.

chrowe commented 2 years ago

Is there any public documentation on the hierarchical matches across layers issue? Is this due to actual differences or just due to the source data and/or simplifications?

Also, is there any documentation on the difference between what can be found on Github vs. Harvard Dataverse? For example, these seem to be quite different and since they are both ADM0 I would expect them to be very similar if not identical. e.g. The shape files are 34 KB and 3.9 MB respectivly.

https://github.com/wmgeolab/geoBoundaries/blob/main/releaseData/gbOpen/IND/ADM0/geoBoundaries-IND-ADM0-all.zip https://dataverse.harvard.edu/file.xhtml?fileId=5614858&version=2.0

Source github dataverse
contents Github Dataverse
DanRunfola commented 2 years ago

Hi Chris,

On your first quesiton - source data. We essentially don't standardize to any ADM level right now, and are just taking the best we can find for each administrative layer independently.

On your second, Harvard Dataverse has the most recent data as of the upload date; we just did a new upload there last year. So any changes in our database locally on github would not yet be reflected in the Harvard post that upload (which I think was circa Nov?). The version is the best way to tell the difference there - i.e., we have 4.0 on the Dataverse, which is located on github under our releases (https://github.com/wmgeolab/geoBoundaries/releases/tag/v4.0.0). That 4.0.0 should match nearly identically the Harvard Dataverse files available, with a few minor exceptions due to a lag between when we acutally did the upload vs. when 4.0.0 came out. So, imperfect but close.

In the near future we hope to actually hook into the Harvard Dataverse API to get this all working properly (just listed a job for someone to help with this full time), so that disconnect should be nearly entirely corrected. I also hope to move to a monthly release cadence, with pushes to the dataverse and HDX monthly to ensure everything is always lined up. Just not quite there yet.

chrowe commented 2 years ago

From what I am seeing these are not close at all. It almost looks like whatever you have on Github is the simplified version. Not sure if there is a better way to explain or show the issue I am seeing. It seems like something different that the general issue you are describing.

chrowe commented 2 years ago
image
DanRunfola commented 2 years ago

Well, let's re-open this one. That is definitely simplified, or something similar, but I don't have an immediate answer as to why or how that's possible.

Was the file you retrieved from the Dataverse marked as 4.0.* or something similar?

chrowe commented 2 years ago

I just downloaded the zip files from my previous post.

But it is the Gitub version that is simplified.

chrowe commented 2 years ago

It looks like the source file is simplified as well https://github.com/wmgeolab/geoBoundaries/blob/main/sourceData/gbOpen/IND_ADM0.zip

chrowe commented 2 years ago

In the metadata Dataverse says boundarySourceURL : https//geonode.pathwaysdata.com/layers/geonode BUT Github says boundarySourceURL : https//commons.wikimedia.org/wiki/File

DanRunfola commented 2 years ago

Oi, ok, I finally tracked this down.

We uploaded geoBoundaries to the Harvard Dataverse at the end of last year (circa December of 2021). In March of 2022, we then updated the ADM0 boundary layer of India because the licensing of the old layer (OdBL) was not compatible with CC-BY.

The old file is still here - https://github.com/wmgeolab/geoBoundaries/blob/a02161c47e471363c6899f4727afc1c2ab0dd612/sourceData/gbOpen/IND_ADM0.zip

So, we ended up losing a better layer because of the licensing issue. Which I don't really like, but as we move towards CCBY only (we don't want share alike, which is the big challenge of ODbL), we're constrained somewhat.

The followup here is if we can identify a ADM0 for IND that is both CC-BY compliant and has the higher resolution we'd like.

leeberryman commented 2 years ago

@DanRunfola i apologize IND was my first contributions and was still feeling out this community and how it all works we submitted our IND data with an ODbL. We could upload more granular and release it with CCY4. We can help with this, but really think @justinelliotmeyers has the best data for IND but don’t know if it’s ready for being contributed here yet.

maxmalynowsky commented 2 years ago

I see that there were a few issues talked about in this thread which I've done some work around.

First, I've documented and implemented hierarchical matches across layers, since I needed that feature when using geoBoundaries as a data source in my own project at FieldMaps. I started by downloading the entire gbOpen 4.0 dataset and loading each set of ADM boundaries together in QGIS. For each country, I checked whether there were major topology issues that prevented matching, and documented the maximum usable admin levels here: https://github.com/fieldmaps/admin-boundaries/blob/main/inputs/meta.csv.

In that sheet, geoboundaries_lvl refers to the admin level I ended up using, which for the vast majority of cases was all of them. If there was an issue that prevented me from using higher levels because of geometry errors or overly disjoint overlaps like for Switzerland, I made note of the higher unused levels in geoboundaries_lvl_max, which effectively becomes a list of datasets to look into.

For the implementation itself, I have a Python / PostGIS pipeline that matches each lower ADM polygon with a higher one based on largest overlapping area. I then re-create all the higher admin levels through dissolving by attribute values, so an ADM0 and an ADM3 both share the exact same outlines, fixing the issues described above regarding USA and IND in particular. I have the outputs of this pipeline publicly available here: https://fieldmaps.io/data/geoboundaries. I'm not sure what the best way of feeding this back into geoBoundaries would be, but feel free to take advantage of this work I've already done. It'll be easy for me to re-run this with 5.0 data when that comes out too.

If you do eventually go ahead with the route of hierarchy matching and dissolving to get lower levels for gbOpen, I don't think ADM0's would be needed at all then, since they could all be derived. Also for the IND dataset, I saw that it had some very difficult to fix geometry errors, with only ADM1 able to be matched using the hierarchy pipeline I described above, and made some notes on alternatives here: https://github.com/wmgeolab/geoBoundaries/issues/2354.

DanRunfola commented 2 years ago

In an attempt to clear out issues, I am going to close this but with a few followups.

First, we have a followup on India here: https://github.com/wmgeolab/geoBoundaries/issues/2354 Second, I want to start a discussion on hierarchy here: https://github.com/wmgeolab/geoBoundaries/discussions/2483 And, third, I created a new issue to look into doing a better job of synching with dataverse here:https://github.com/wmgeolab/geoBoundaries/issues/2485