Open TomNicholas opened 5 months ago
I've been looking at this more, in the context of fixing #189.
I think we could probably almost use xarray.decode_cf
out of the box, with a few subtleties:
decode_cf
can be turned off via flags, which is what we want, except for the BooleanCoder usage heredecode_cf
includes a bunch of more complicated logic for deciding which variables are coords
, and we really need that to be the same as in xarray to guard against issues like #189.LazilyIndexedArray
hereAligning what we're doing in virtualizarr with this would be another reason to implement a virtualizarr xarray backend engine, see #35.
cc @ayushnag
I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.
As a side note, this might make the reference creation process (https://github.com/zarr-developers/VirtualiZarr/pull/87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.
It would also add the ability to easily inline data (https://github.com/zarr-developers/VirtualiZarr/issues/62)
Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr")
loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays
. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays
unless the top level params ask for data such as loadable_variables
and cftime_variables
Thanks @ayushnag . I think your suggestion is almost the opposite of what I'm suggesting doing in this issue, so I moved it to #221 to discuss there.
(Opposite in that I'm suggesting calling just part of xarray's open_dataset
decoding logic from within open_virtual_dataset
, whereas IIUC you are suggesting basically using all of xr.open_dataset
and making it virtual as a final step.)
I think your suggestion makes sense and yes my suggestion was more related to the backend issue.
My only concern was that if the solution is to add this to open_virtual_dataset
it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes. That's why I was suggesting opening the file and then adding references.
Although if decode_cf
is the main portion left then the reimplementations are limited and your approach definitely makes more sense. Is the plan to add a modified copy of this function to virtualizarr?
it can get tricky to essentially reimplement many parts of xarray and then also maintain them with upstream xarray changes.
I completely agree, which is why I've been looking into which parts of xarray we need. I think it's just decode_cf
though (and actually only very limited parts of that).
Is the plan to add a modified copy of this function to virtualizarr?
Either that or make some small modifications to xarray upstream so that we can import and use xr.decode_cf
directly. See also https://github.com/pydata/xarray/issues/4490, which talks about making xarray's interface for decoding clearer.
We might be able to replace some logic (especially that implemented in #156) with a call to
xarray.decode_cf
. This function can accept aDataset
and return a new dataset with different attributes and encoding etc. I think it's used internally in some of xarray's backends.Also there is an idea to expose a corresponding
xarray.encode_cf
function in xarray, which we might also be able to use (see https://github.com/pydata/xarray/issues/4412).