Develop a cookiecutter template for virtualization

Context

I think there would be value in building and sharing a cookiecutter template for virtualizing datasets, to incentivize open and accessible VirtualiZarr workflows. We could also use cruft to allow updating workflows for upstream changes.

There are shared steps between most virtualization workflows:

Generate a list of input files
Generate virtual datasets for each input file, with optional pre- or post-processing for this step
Concatenate virtual datasets into a single virtual dataset
Write the virtual dataset to a virtual Icechunk store or Kerchunk reference file
(Optional) apply the above workflow to multiple datasets
(Optional) generate a catalog (e.g., STAC) for multiple virtual datasets

There are many other boilerplate components:

Typing
Documentation
Licensing
CI/CD
Environment management

Lastly, there are parallelization, orchestration, and execution tools tools which could enhance virtualization workflows, with options including:

Dask
Flyte
Lithops
Modal
Coiled

This template would enable people to use best-practices and avoid spending time on boilerplate components.

Suggested task components

[ ] Build out an example of a well-structured virtualization pipeline (https://github.com/developmentseed/virtualize-nex-gddp-cmip6 could grow into this but is currently insufficient)
[ ] Fork https://github.com/fpgmaas/cookiecutter-uv to build a cookiecutter template for virtualizing data as Zarr
[ ] Add different execution backends, following design in https://github.com/earth-mover/serverless-datacube-demo
[ ] Add additional components to the template over time (e.g., appending, validation)

zarr-developers / VirtualiZarr

Develop a cookiecutter template for virtualization #319

Context

Suggested task components