I think there would be value in building and sharing a cookiecutter template for virtualizing datasets, to incentivize open and accessible VirtualiZarr workflows. We could also use cruft to allow updating workflows for upstream changes.
There are shared steps between most virtualization workflows:
Generate a list of input files
Generate virtual datasets for each input file, with optional pre- or post-processing for this step
Concatenate virtual datasets into a single virtual dataset
Write the virtual dataset to a virtual Icechunk store or Kerchunk reference file
(Optional) apply the above workflow to multiple datasets
(Optional) generate a catalog (e.g., STAC) for multiple virtual datasets
There are many other boilerplate components:
Typing
Documentation
Licensing
CI/CD
Environment management
Lastly, there are parallelization, orchestration, and execution tools tools which could enhance virtualization workflows, with options including:
Dask
Flyte
Lithops
Modal
Coiled
This template would enable people to use best-practices and avoid spending time on boilerplate components.
Context
I think there would be value in building and sharing a cookiecutter template for virtualizing datasets, to incentivize open and accessible VirtualiZarr workflows. We could also use cruft to allow updating workflows for upstream changes.
There are shared steps between most virtualization workflows:
There are many other boilerplate components:
Lastly, there are parallelization, orchestration, and execution tools tools which could enhance virtualization workflows, with options including:
This template would enable people to use best-practices and avoid spending time on boilerplate components.
Suggested task components