yuvipanda / pangeo-forge-cmr

pangeo-forge integration with CMR
Apache License 2.0
3 stars 3 forks source link

Retrieve provenance information from CMR #4

Open doug-newman-nasa opened 1 year ago

doug-newman-nasa commented 1 year ago
  1. Populate the meta.yaml file of a recipe with provenance information from the collection record in CMR.
  2. Retrieve the short name and version id from meta.yaml (if present) and use that to query CMR. This way, the provenance of the input for a recipe is discernible from the configuration in a known location, rather than the code.

Given the above, I'm leaning to an implementation where we have a class constructor that figures out the input collection and/or retrieves the provenance information for a collection and populates the yaml file. Then users invoke the get files method. For example:

instance = PangeoForgeCMR('MOD02QKM', '6.1')
file_pattern = instance.files_from_cmr(nitems_per_file=1,
        concat_dim='time')

Alternatively, if the short name and version id are to be extracted from meta.yaml

instance = PangeoForgeCMR()
file_pattern = instance.files_from_cmr(nitems_per_file=1,
        concat_dim='time')

Where the yaml file contains,

provenance:
  ...
  source_data:
    cmr:
      short_name: GPM_3IMERGDL
      version: '06'

Please give your thoughts on this API suggestion before I implement!

yuvipanda commented 1 year ago

Thanks for opening this up! /cc @cisaacstern to see what he thinks. I'll try to come back and provide a more whole response in a day or so!

cisaacstern commented 1 year ago

Thanks for thinking through this @doug-newman-nasa, and for the tag @yuvipanda. Some thoughts:

In this paradigm, create-cmr-feedstock initializer could be pretty opinionated, for generating:

# meta.yaml
provenance:
  ...
  source_data:
    cmr:
      short_name: GPM_3IMERGDL
      version: '06'

and

from pangeo_forge_cmr import files_from_cmr

file_pattern = files_from_cmr(
    short_name='GPM_3IMERGDL',
    version='06',
    nitems_per_file=1,
    concat_dim='time',
)

With really the only difference from the initial proposal being the layer that this is generation is not handled in pangeo_forge_cmr (which extends -recipes) and rather in a separate (higher) create-cmr-feedstock initializer layer (which extends -runner).

Thoughts? xref https://github.com/pangeo-forge/pangeo-forge-runner/issues/94