zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
123 stars 24 forks source link

Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model #308

Open maxrjones opened 3 days ago

maxrjones commented 3 days ago

I'm wondering how we can build virtualization pipelines as a map rather than a map-reduce process. The map paradigm would follow the structure below from Earthmover's excellent blog post on serverless datacube pipelines, where the virtual dataset from each worker would get written directly to an Icechunk Virtual Store rather than being transferred back to the coordination node for a concatenation step. Similar to their post, the parallelization between workers during virtualization could leverage a serverless approach like lithops or coiled. Dask concurrency could speed up virtualization within each worker. I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="..."), complementary to xr.Dataset.to_zarr(append_dim="...")(documented in https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) (xref https://github.com/zarr-developers/VirtualiZarr/issues/21, https://github.com/zarr-developers/VirtualiZarr/pull/272).

Image

I tried to check that this feature request doesn't already exist, but apologies if I missed something.

TomNicholas commented 3 days ago

This is a really extremely good question that I have also thought about a bit.

I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front. To use zarr regions you have to know how big the regions are in advance to create an empty zarr array that will be filles, but In the "map-reduce" approach you are able to concatenate files whose shapes you don't know until you open them. Whether that actually matters much in practice is another question.

One thing to note is that IIUC having each worker write a region of virtual references would mean one icechunk commit per region, whereas transferring everything back to one node means one commit per logical version of the dataset, which seems a little nicer.

I think the main feature needed for this to work is an analogue to xr.Dataset.to_zarr(region="...")

I agree.

norlandrhagen commented 3 days ago

I really like the idea behind this approach @maxrjones! Being able to skip transferring all the references back to a 'reduce' step seem great. Also wondering about how all the commits between workers would work with icechunk.

norlandrhagen commented 3 days ago

This also seems like a good way to protect against a single failed reference killing the entire job.

dcherian commented 2 days ago

I think a big difference is that this "map" approach assumes more knowledge the structure of the dataset up front.

:+1:

IIUC having each worker write a region of virtual references would mean one icechunk commit per region,

It does not, once https://github.com/earth-mover/icechunk/pull/357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case...

maxrjones commented 2 days ago

IIUC having each worker write a region of virtual references would mean one icechunk commit per region,

It does not, once https://github.com/earth-mover/icechunk/pull/357 is in. Though in practice, that also does map-reduce on changesets which will contain virtual references in this case...

Thanks for this information! Do you think it's accurate that map-reduce on changesets would still be more performant than map-reduce on virtual datasets when operating on thousands or more of files, such that this feature would be helpful despite not truly by a map operation? Also, do you anticipate there being other prerequisites in addition to your PR before VirtualiZarr could have something like dataset_to_icechunk(region="...")?