worldbank / DECAT_Space2Stats

https://worldbank.github.io/DECAT_Space2Stats/
Other
1 stars 1 forks source link

Review/population parquet #7

Closed zacharyDez closed 6 days ago

zacharyDez commented 3 months ago

Creating a draft PR so we can discuss the population parquet dataset. Open questions noted in the notebook:

zacharyDez commented 3 months ago

Do you have some thoughts on the questions listed in the draft PR? I saw your approval, but it was more intended to get a conversation going. The target branch is your own with the dataset generation, so happy to merge that if it's helpful.

I talked today with @vincentsarago concerning the metadata approach. We will plan a follow-up conversation next week. Some details that were worth sharing from the call with Vincent today:

kylebarron commented 3 months ago
  • Should we use another convention, NaN, or null instead of -1 for no data values when numerical values? What about other types of values?

Parquet has separate bitmasks for nullability, so you can use null for all numeric types (separate from a float NaN).

zacharyDez commented 3 months ago
  • Should we use another convention, NaN, or null instead of -1 for no data values when numerical values? What about other types of values?

Parquet has separate bitmasks for nullability so that you can use null for all numeric types (separate from a float NaN).

@bpstewar, this answers the second top-level question. Are there any considerations around using -1 that we aren't thinking of? It seems to mainly impact the data pipeline end of things on your side. Comment remains relevant if we transition to the S2 approach.

zacharyDez commented 2 months ago

@bpstewar: I made a short script to transform the population parquet into a single file with all the attributes so we can start the API design milestones. There definitely can be some optimizations, but it seemed like a one-off.

zacharyDez commented 6 days ago

@bpstewar; this is mainly met to serve as an example of how to process the data. I'll close for now, but it can be a reference for how you implement you're processing steps to get the geoparquet.