Closed zacharyDez closed 6 days ago
Do you have some thoughts on the questions listed in the draft PR? I saw your approval, but it was more intended to get a conversation going. The target branch is your own with the dataset generation, so happy to merge that if it's helpful.
I talked today with @vincentsarago concerning the metadata approach. We will plan a follow-up conversation next week. Some details that were worth sharing from the call with Vincent today:
Space2Stats/parquet/GLOBAL/NTL_VIIRS_LEN/2012/01/
, Space2Stats/parquet/GLOBAL/WorldPop_2020_Demographics
). In terms of guidelines for generating the metadata, the most important aspect is defining the attributes available. The queryable extension functions at the collection level, so we could either define our own extension at the item level or just define the attributes within the properties of the STAC item.
- Should we use another convention,
NaN
, ornull
instead of-1
for no data values when numerical values? What about other types of values?
Parquet has separate bitmasks for nullability, so you can use null
for all numeric types (separate from a float NaN).
- Should we use another convention,
NaN
, ornull
instead of-1
for no data values when numerical values? What about other types of values?Parquet has separate bitmasks for nullability so that you can use
null
for all numeric types (separate from a float NaN).
@bpstewar, this answers the second top-level question. Are there any considerations around using -1
that we aren't thinking of? It seems to mainly impact the data pipeline end of things on your side. Comment remains relevant if we transition to the S2 approach.
@bpstewar: I made a short script to transform the population parquet into a single file with all the attributes so we can start the API design milestones. There definitely can be some optimizations, but it seemed like a one-off.
@bpstewar; this is mainly met to serve as an example of how to process the data. I'll close for now, but it can be a reference for how you implement you're processing steps to get the geoparquet.
Creating a draft PR so we can discuss the population parquet dataset. Open questions noted in the notebook:
f'{aggregation_method}_{attribute_name}'
to keep the datasets explicit when a dataset has multiple attributes to aggregate and avoid introspection of dataset name for subsequent processing steps?NaN
, ornull
instead of-1
for no data values when numerical values? What about other types of values?