zarr-developers / zarr-specs

Zarr core protocol for storage and retrieval of N-dimensional typed arrays
https://zarr-specs.readthedocs.io/
Creative Commons Attribution 4.0 International
86 stars 28 forks source link

Zarr as a "universal reader" for netCDF etc., via new CF decoding codecs #303

Open TomNicholas opened 1 month ago

TomNicholas commented 1 month ago

Idea: Use zarr readers to open and decode netCDF/HDF/etc. data without xarray by lifting xarray's decoding machinery out as new zarr codecs.

This was suggested by @sharkinsspatial in https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2197682388 and requires two components: 1) The chunk manifest storage transformer proposed in https://github.com/zarr-developers/zarr-specs/issues/287, which would allow zarr stores to redirect zarr readers to read byte ranges from inside arbitrary files, including legacy formats such as netCDF. We (particularly @abarciauskas-bgse, Sean and myself) are working on making this happen already, so that we can open netCDF data via zarr using xarray, effectively upstreaming kerchunk's references format as a zarr extension. 2) Decoding according to CF conventions via new Zarr codecs. This is currently done automatically and somewhat opaquely by xarray when reading a netCDF file directly, but it's still done by xarray even when we read a netCDF file via kerchunk/virtualizarr byte range references. This decoding step is well-factored out internally inside xarray but not really publicly exposed (at least not without the rest of xarray as a dependency). The suggestion (originally from @rabernat in https://github.com/zarr-developers/VirtualiZarr/issues/68#issuecomment-2048034811) is to lift that code out of xarray as a set of CF-specific zarr codecs that get called when a zarr reader opens a store with a manifest pointing to a netCDF file.

To be really useful this probably also requires variable-length chunking in zarr (i.e. ZEP003).

The advantages of this are: a) a clearer separation of concerns, with fewer "magic" steps hidden inside xarray, b) applications that can read zarr but don't want to use xarray could also read and fully decode netCDF data (i.e. pure-zarr users see the same data as xarray users), c) clearer steps towards generalizing to non-CF encoding conventions used in other domains of science, d) opening the door to zarr becoming a "universal reader" of any file format whose data can be expressed as a manifest of byte ranges and decoding steps can be expressed as zarr codecs.

Most of the work here would be on the xarray end - there is an ancient issue suggesting something similar in https://github.com/pydata/xarray/issues/155, and a nice explanation of how xarray currently does this step in https://github.com/pydata/xarray/issues/8548. Currently it looks essentially like this

xarray.Dataset < dask chunking < CF decoding (using xarray's VariableCoder) < opening via datastore < file

where one of xarray's options for datastore is for zarr, and another is for netCDF (these are xarray's "backends"). I'm proposing something more like

xarray.Dataset < dask chunking < zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file

where non-xarray users can still get all of

zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file


One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?

d-v-b commented 1 month ago

Thanks for the writeup Tom, a big +1 from me on this effort.

One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?

From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs

TomNicholas commented 1 month ago

From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs

Does an ArrayArrayCodec know about the names of dimensions? Or metadata attributes (i.e. .zmetadata)? Because the VariableCoder has access to that information, as it is stored on the xarray.Variable object passed in.