Parallelization via dask

zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax

Apache License 2.0

68 stars 10 forks source link

There are two places we could use xarray's machinery for parallelization to potentially speed up the generation of references.

1) Using parallel=True in xr.open_mfdataset, which would then use dask.delayed to parallelize the generation of the byte ranges from each file. This could be a big speedup, as it would parallelize the opening of the legacy files.

2) In theory we could also wrap the ManifestArray objects with dask.Array, then use dask's tree-reduce to do the concatenation. I think this is roughly what kerchunk.combine.auto_dask is approximating. However I'm not totally confident that (a) this is set up to work right now in dask.array or (b) this actually is a performance bottleneck in practice.

zarr-developers / VirtualiZarr

Parallelization via dask #7