zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
68 stars 10 forks source link

Parallelization via dask #7

Open TomNicholas opened 4 months ago

TomNicholas commented 4 months ago

There are two places we could use xarray's machinery for parallelization to potentially speed up the generation of references.

1) Using parallel=True in xr.open_mfdataset, which would then use dask.delayed to parallelize the generation of the byte ranges from each file. This could be a big speedup, as it would parallelize the opening of the legacy files.

2) In theory we could also wrap the ManifestArray objects with dask.Array, then use dask's tree-reduce to do the concatenation. I think this is roughly what kerchunk.combine.auto_dask is approximating. However I'm not totally confident that (a) this is set up to work right now in dask.array or (b) this actually is a performance bottleneck in practice.

TomNicholas commented 3 weeks ago

earthaccess uses dask.delayed on kerchunk SingleHdf5ToZarr calls

https://github.com/nsidc/earthaccess/blob/main/earthaccess/kerchunk.py#L47

@dcherian instead used dask.bag to do a tree-reduce inside xr.open_mfdataset without shipping datasets all back to the head node, which might be exactly what we need

https://github.com/pydata/xarray/issues/8523