wmgeolab / geoBoundaries

geoBoundaries : A Political Administrative Boundaries Dataset (www.geoboundaries.org)
http://www.geoboundaries.org
Other
283 stars 51 forks source link

[FEATURE REQUEST] Add Population Statistics #2551

Closed DanRunfola closed 1 year ago

DanRunfola commented 2 years ago

An interesting request came through to add population statistics to our boundaries- this should be very accomplishable, and something we could integrate into a second part of our build process fairly easily (i.e., generate a *.csv that joins cleanly as a second step after boundary creation).

https://sedac.ciesin.columbia.edu/data/collection/gpw-v4/population-estimation-service would be an easy-to-use API that would enable this.

leeberryman commented 1 year ago

@DanRunfola our team has done alot with this. We have been using WorldPop data for a population pyramid for 2020 UN adjusted estimates.

DanRunfola commented 1 year ago

@leeberryman do you know what you've seen in terms of memory footprint requirements / time to run the zonal stats operations?

I'm mostly concerned about being able to do this with cloud based infrastructure. Right now the only thing we have to do offline is the CGAZ build, and that lags way behind because of that.

leeberryman commented 1 year ago

@DanRunfola I have it running inline in ci/cd in azure AKS and fastapi so I can play around with large shapes like RUS ADM0 and come back with a good idea minimum requirements. Our prototyping was all done on a server with big GPU and lots of ram, but I think we have a pretty efficient process now.

DanRunfola commented 1 year ago

Just for context, if I recall correctly the Github / Azure runners we're using now cap out at 8GB of memory.

Not to say we couldn't use other runners, but I prefer to keep everything in one pipe for sustainability purposes. I'm vaguely considering moving the runners themselves in-house one day soon to give more flexibility, but that will require some pretty serious on-prem engineering before I can click go.

leeberryman commented 1 year ago

@DanRunfola thanks for the context. My two node sizes in kubernettes right now are 4gb ram and 8gb ram so I’ll play around with specs in those constraints and get back to you on area thresholds and time to process that would be comparable to GitHub actions.

alex-translation commented 1 year ago

Hi @DanRunfola, I've worked with @leeberryman for the last year on population and I think 8gb should be more than adequate. The only times we ran into memory issues were when we tried to load all of RUS admin 0; everything else worked just fine.

I don't remember the memory requirements we had, but we estimated we could get population for the ~ 4.5m polygons we had (every ISO/admin combination) in about 1 week. We used some fairly high end hardware, purpose-built for data science. @leeberryman do you remember what specifically?

leeberryman commented 1 year ago

@DanRunfola We have been working with GDAL on this most recently and it's become very fast. I got access to GitHub's public beta for large runners and going to experiment with processing in GitHub actions.

DanRunfola commented 1 year ago

In parallel to this, I am also working on replicating the gB build scripts to our local cluster (which would allow nightly builds / rasters on boxes with up to 512GB :) ).

leeberryman commented 1 year ago

@DanRunfola we have taken WORLDPOPs rasters and converted them to COGs (Cloud-optimized-geotiffs) and hosting in blob storage. This means downloading the appropriate data for analysis on the fly isn’t needed you just point the zonalStats to the url and it does the analysis. It completes within seconds on almost every scenario.

DanRunfola commented 1 year ago

Where do you store them for access via GH actions? Azure?

leeberryman commented 1 year ago

Yes we are using Azure Blob