Open ayushnag opened 3 months ago
Thinking about this more, once zarr-python Array
objects support the manifest storage transformer, we should be able to write a new method on ManifestArray
objects which constructs the zarr.Array
directly, i.e.
def to_zarr_array(self: ManifestArray) -> zarr.Array:
...
This opens up some interesting possibilities. Currently when you call .compute
on a virtual dataset you get a NotImplementedError
, but with this we could change the behaviour to instead:
ManifestArray
into a zarr.Array
xr.open_zarr
.compute
behaviour that xarray would normally use.The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to .compute
it it should transform itself into one under the hood!
Then you could open any data format that virtualizarr understands via vz.open_virtual_dataset
(or maybe eventually xr.open_dataset(engine='virtualizarr')
), and if you want to treat it like an in-memory xarray Dataset from that point on then you can, but if you prefer to manipulate it and save it out as a virtual zarr store on disk you can also do that!
I still need to think through some of the details, but this could potentially be a neat alternative approach to https://github.com/pydata/xarray/issues/9281, and not actually require any upstream changes to xarray!
cc @d-v-b
(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in https://github.com/zarr-developers/VirtualiZarr/issues/18, and then have the indexes understand how they can be loaded.)
Another subtlety to consider is when should the CF decoding happen? You would then have effectively done open_dataset
in a very roundabout way, and we need to make sure not to forget the CF decoding step in there somewhere.
I am working on a feature in
virtualizarr
to read dmrpp metadata files and create a virtualxr.Dataset
containing manifest array's that can then be virtualized. This is the current workflow:However the chunk manifest, encoding, attrs, etc. is already in
mds
so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and thezarr-python
reader inxarray
is updated this should be possible. Thexarray
reader forkerchunk
can accept a file or the reference json object directly fromkerchunk
SingleHdf5ToZarr
andMultiZarrToZarr
. So similarly can we extract the refs frommds
and pass it toxr.open_dataset()
directly?There probably still needs to be a function that extracts the refs so that xarray can make a new
Dataset
object with all the indexes, cf_time handling, andopen_dataset
checks.Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset