Review/population parquet

zacharyDez commented 3 months ago

Creating a draft PR so we can discuss the population parquet dataset. Open questions noted in the notebook:

Should we rename attributes in the dataset to f'{aggregation_method}_{attribute_name}' to keep the datasets explicit when a dataset has multiple attributes to aggregate and avoid introspection of dataset name for subsequent processing steps?
Should we use another convention, NaN, or null instead of -1 for no data values when numerical values? What about other types of values?
Should we generate metadata for all the partitions of each dataset, the eventual combined dataset with all partitions, or should we wait until we have all the datasets integrated together?

zacharyDez commented 3 months ago

Do you have some thoughts on the questions listed in the draft PR? I saw your approval, but it was more intended to get a conversation going. The target branch is your own with the dataset generation, so happy to merge that if it's helpful.

I talked today with @vincentsarago concerning the metadata approach. We will plan a follow-up conversation next week. Some details that were worth sharing from the call with Vincent today:

For the metadata, it would be great to have a STAC collection containing all of our datasets' metadata. We could have a STAC Item for each dataset-year pairing. Within these items, we could define an asset pointing to the folder with the parquet files which matches your current structure (i.e. Space2Stats/parquet/GLOBAL/NTL_VIIRS_LEN/2012/01/, Space2Stats/parquet/GLOBAL/WorldPop_2020_Demographics). In terms of guidelines for generating the metadata, the most important aspect is defining the attributes available. The queryable extension functions at the collection level, so we could either define our own extension at the item level or just define the attributes within the properties of the STAC item.
We may not need to specify the attribute name for datasets that have a single variable (i.e. the population dataset), but I believe it would cause less friction to always define the attribute. For example, when joining two single variable datasets that both have the columns min, max, sum, a user would need to carefully rename the columns to reflect the dataset's title.

kylebarron commented 3 months ago

Should we use another convention, NaN, or null instead of -1 for no data values when numerical values? What about other types of values?

Parquet has separate bitmasks for nullability, so you can use null for all numeric types (separate from a float NaN).

zacharyDez commented 3 months ago

Should we use another convention, NaN, or null instead of -1 for no data values when numerical values? What about other types of values?

Parquet has separate bitmasks for nullability so that you can use null for all numeric types (separate from a float NaN).

@bpstewar, this answers the second top-level question. Are there any considerations around using -1 that we aren't thinking of? It seems to mainly impact the data pipeline end of things on your side. Comment remains relevant if we transition to the S2 approach.

zacharyDez commented 2 months ago

@bpstewar: I made a short script to transform the population parquet into a single file with all the attributes so we can start the API design milestones. There definitely can be some optimizations, but it seemed like a one-off.

zacharyDez commented 6 days ago

@bpstewar; this is mainly met to serve as an example of how to process the data. I'll close for now, but it can be a reference for how you implement you're processing steps to get the geoparquet.

worldbank / DECAT_Space2Stats

Review/population parquet #7