Open TomNicholas opened 1 month ago
Thanks for the writeup Tom, a big +1 from me on this effort.
One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?
From glancing at the signature and a few implementations, it looks like the VariableCoder
is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs
From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to
ArrayArrayCodecs
Does an ArrayArrayCodec
know about the names of dimensions? Or metadata attributes (i.e. .zmetadata
)? Because the VariableCoder
has access to that information, as it is stored on the xarray.Variable
object passed in.
Idea: Use zarr readers to open and decode netCDF/HDF/etc. data without xarray by lifting xarray's decoding machinery out as new zarr codecs.
This was suggested by @sharkinsspatial in https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2197682388 and requires two components: 1) The chunk manifest storage transformer proposed in https://github.com/zarr-developers/zarr-specs/issues/287, which would allow zarr stores to redirect zarr readers to read byte ranges from inside arbitrary files, including legacy formats such as netCDF. We (particularly @abarciauskas-bgse, Sean and myself) are working on making this happen already, so that we can open netCDF data via zarr using xarray, effectively upstreaming kerchunk's references format as a zarr extension. 2) Decoding according to CF conventions via new Zarr codecs. This is currently done automatically and somewhat opaquely by xarray when reading a netCDF file directly, but it's still done by xarray even when we read a netCDF file via kerchunk/virtualizarr byte range references. This decoding step is well-factored out internally inside xarray but not really publicly exposed (at least not without the rest of xarray as a dependency). The suggestion (originally from @rabernat in https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2048034811) is to lift that code out of xarray as a set of CF-specific zarr codecs that get called when a zarr reader opens a store with a manifest pointing to a netCDF file.
To be really useful this probably also requires variable-length chunking in zarr (i.e. ZEP003).
The advantages of this are: a) a clearer separation of concerns, with fewer "magic" steps hidden inside xarray, b) applications that can read zarr but don't want to use xarray could also read and fully decode netCDF data (i.e. pure-zarr users see the same data as xarray users), c) clearer steps towards generalizing to non-CF encoding conventions used in other domains of science, d) opening the door to zarr becoming a "universal reader" of any file format whose data can be expressed as a manifest of byte ranges and decoding steps can be expressed as zarr codecs.
Most of the work here would be on the xarray end - there is an ancient issue suggesting something similar in https://github.com/pydata/xarray/issues/155, and a nice explanation of how xarray currently does this step in https://github.com/pydata/xarray/issues/8548. Currently it looks essentially like this
where one of xarray's options for
datastore
is for zarr, and another is for netCDF (these are xarray's "backends"). I'm proposing something more likewhere non-xarray users can still get all of
zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file
One question is how well does xarray's internal concept of a
VariableCoder
map onto a zarr codec?