wildlife-dynamics / ecoscope-workflows

An extensible task specification and compiler for local and distributed workflows.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Task: ETL - Explode Column #41

Open walljcg opened 1 week ago

walljcg commented 1 week ago

We need an ETL task to explode the contents of a dataframe column. There are two scenarios to consider:

The task params are:

  1. input column: string
  2. column_names: array

Expected Behaviour:

  1. a dataframe column contains json/dict values: In this case can use the DataFrame.json_normalize() function for this. Then the dict keys become new column names and the values are the new column values. We already do this within the ecoscope.io.earthranger._normalize_column() function and maybe this should be moved out of ecoscope.io.earthranger and into ecoscope.io.utils to be more generally available.
  2. a dataframe column contains an array: In this case we should create a new set of DF columns with user supplied names (or val1, val2, val3,...) if there are no user supplied names. The new columns receive the array values.
  3. if the column_names param doesn't match the length of the dict keys (if dict type), or array length (if an array type) then we should give a warning but proceed to name extra columns 'val6, val7' etc. If the supplied name column is shorter than just truncate but provide a warning.
  4. There maybe cases with nested arrays. In that case a user may have to apply the task again to surface the values needed from nested arrays or dicts.