zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
93 stars 16 forks source link

Loading data from ManifestArrays without saving references to disk first #124

Open ayushnag opened 3 months ago

ayushnag commented 3 months ago

I am working on a feature in virtualizarr to read dmrpp metadata files and create a virtual xr.Dataset containing manifest array's that can then be virtualized. This is the current workflow:

vdatasets = parser.parse(dmrs)
# vdatasets are xr.Datasets containing ManifestArray's
mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.virtualize.to_kerchunk(filepath=outfile, format=outformat)
ds = xr.open_dataset(outfile, engine="virtualizarr", ...)
ds.time.values

However the chunk manifest, encoding, attrs, etc. is already in mds so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and the zarr-python reader in xarray is updated this should be possible. The xarray reader for kerchunk can accept a file or the reference json object directly from kerchunk SingleHdf5ToZarr and MultiZarrToZarr. So similarly can we extract the refs from mds and pass it to xr.open_dataset() directly?

There probably still needs to be a function that extracts the refs so that xarray can make a new Dataset object with all the indexes, cf_time handling, and open_dataset checks.

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
refs = mds.virtualize()
ds = xr.open_dataset(refs, engine="virtualizarr", ...)

Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.time.values
TomNicholas commented 1 month ago

Thinking about this more, once zarr-python Array objects support the manifest storage transformer, we should be able to write a new method on ManifestArray objects which constructs the zarr.Array directly, i.e.

def to_zarr_array(self: ManifestArray) -> zarr.Array:
   ...

This opens up some interesting possibilities. Currently when you call .compute on a virtual dataset you get a NotImplementedError, but with this we could change the behaviour to instead:

  1. Turn the ManifestArray into a zarr.Array
  2. Use xarray's zarr backend machinery to open up that zarr array the same way that normally happens when you do xr.open_zarr
  3. Which includes wrapping with xarray's lazy indexing classes,
  4. Call the .compute behaviour that xarray would normally use.

The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to .compute it it should transform itself into one under the hood!

Then you could open any data format that virtualizarr understands via vz.open_virtual_dataset (or maybe eventually xr.open_dataset(engine='virtualizarr')), and if you want to treat it like an in-memory xarray Dataset from that point on then you can, but if you prefer to manipulate it and save it out as a virtual zarr store on disk you can also do that!

I still need to think through some of the details, but this could potentially be a neat alternative approach to https://github.com/pydata/xarray/issues/9281, and not actually require any upstream changes to xarray!

cc @d-v-b

TomNicholas commented 1 month ago

(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in https://github.com/zarr-developers/VirtualiZarr/issues/18, and then have the indexes understand how they can be loaded.)

TomNicholas commented 1 month ago

Another subtlety to consider is when should the CF decoding happen? You would then have effectively done open_dataset in a very roundabout way, and we need to make sure not to forget the CF decoding step in there somewhere.