zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
123 stars 24 forks source link

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Open TomNicholas opened 5 months ago

TomNicholas commented 5 months ago

We might be able to replace some logic (especially that implemented in #156) with a call to xarray.decode_cf. This function can accept a Dataset and return a new dataset with different attributes and encoding etc. I think it's used internally in some of xarray's backends.

Also there is an idea to expose a corresponding xarray.encode_cf function in xarray, which we might also be able to use (see https://github.com/pydata/xarray/issues/4412).

TomNicholas commented 3 months ago

I've been looking at this more, in the context of fixing #189.

I think we could probably almost use xarray.decode_cf out of the box, with a few subtleties:

Aligning what we're doing in virtualizarr with this would be another reason to implement a virtualizarr xarray backend engine, see #35.

cc @ayushnag

ayushnag commented 3 months ago

I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

As a side note, this might make the reference creation process (https://github.com/zarr-developers/VirtualiZarr/pull/87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.

It would also add the ability to easily inline data (https://github.com/zarr-developers/VirtualiZarr/issues/62)

Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

TomNicholas commented 3 months ago

Thanks @ayushnag . I think your suggestion is almost the opposite of what I'm suggesting doing in this issue, so I moved it to #221 to discuss there.

(Opposite in that I'm suggesting calling just part of xarray's open_dataset decoding logic from within open_virtual_dataset, whereas IIUC you are suggesting basically using all of xr.open_dataset and making it virtual as a final step.)

ayushnag commented 3 months ago

I think your suggestion makes sense and yes my suggestion was more related to the backend issue.

My only concern was that if the solution is to add this to open_virtual_dataset it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes. That's why I was suggesting opening the file and then adding references.

Although if decode_cf is the main portion left then the reimplementations are limited and your approach definitely makes more sense. Is the plan to add a modified copy of this function to virtualizarr?

TomNicholas commented 3 months ago

it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes.

I completely agree, which is why I've been looking into which parts of xarray we need. I think it's just decode_cf though (and actually only very limited parts of that).

Is the plan to add a modified copy of this function to virtualizarr?

Either that or make some small modifications to xarray upstream so that we can import and use xr.decode_cf directly. See also https://github.com/pydata/xarray/issues/4490, which talks about making xarray's interface for decoding clearer.