zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
120 stars 23 forks source link

datatree backend for opening grib files #11

Open TomNicholas opened 8 months ago

TomNicholas commented 8 months ago

Recently a way of kerchunking grib data as a DataTree object was added https://github.com/fsspec/kerchunk/pull/399. Since the ongoing xarray-datatree integration is adding an open_datatree method to xarray's backendentrypoint classes, it's likely that we could make a open_datatree method that understands how to read a grib file and return a datatree containing ManifestArray objects.

TomNicholas commented 7 months ago

We actually don't need to wait for anything upstream in xarray to occur before making something useful here. We could simply create a new virtualizarr.open_virtual_datatree function, which would detect the filetype, loop over the groups, and use open_virtual_datatree(/kerchunk directly if necessary) to first create the virtual xr.Dataset objects, then put them all into a datatree.Datatree to return. This function could be modelled after how datatree.open_datatree currently works.

At that point you would have a datatree.Datatree object wrapping lots of ManifestArray objects (let's call it vdt1 for "virtual datatree 1"). You could concatenate two such trees using

from datatree import map_over_subtree

combined_virtual_tree = datatree.map_over_subtree(xr.concat, vdt1, vdt2, dim=
'time')

(cc @maxrjones, who asked about doing something similar but for nested HDF5 files)