zarr-developers / zarr-python

An implementation of chunked, compressed, N-dimensional arrays for Python.
https://zarr.readthedocs.io
MIT License
1.46k stars 274 forks source link

Backwards compatibility for reading Array.filters and Array.codecs #2194

Open TomAugspurger opened 5 days ago

TomAugspurger commented 5 days ago

Zarr version

v3

Numcodecs version

n/a

Python Version

n/a

Operating System

n/a

Installation

n/a

Description

As part of getting xarray ready for zarr v3, I'm looking at how to handle the codec and filter API.

The primary / first place this is accessed is https://github.com/pydata/xarray/blob/1c6300c415efebac15f5ee668a3ef6419dbeab63/xarray/backends/zarr.py#L555-L556, which just reads the values of .filters and .compressor to place them in the DataArray.encoding. A few questions:

  1. I'd like to add a .codecs property to the CodecPipeline ABC. This is fine for the BatchedCodecPipeline which AFAICT is the only actual codec pipeline. Does anyone foresee an issue with that? I'm not sure why that class is abstract and loadable through the config.
  2. Is it fair to say that filters is the same as array_array_codecs?
  3. Is it fair to say that compressor is the same as array_bytes_codecs?

There's also https://github.com/pydata/xarray/blob/1c6300c415efebac15f5ee668a3ef6419dbeab63/xarray/backends/zarr.py#L79, which accesses Codec.codec_id. I'm not sure yet about how to handle that, but right now the best is maybe .to_dict()["name"] (or we could have .to_dict() access a code_id)?

Steps to reproduce

n/a

Additional output

No response

d-v-b commented 5 days ago

Is it fair to say that filters is the same as array_array_codecs? Is it fair to say that compressor is the same as array_bytes_codecs?

Unfortunately, no. This was discussed in an earlier discussion here: https://github.com/zarr-developers/zarr-python/pull/1944.

There's also https://github.com/pydata/xarray/blob/1c6300c415efebac15f5ee668a3ef6419dbeab63/xarray/backends/zarr.py#L79, which accesses Codec.codec_id. I'm not sure yet about how to handle that, but right now the best is maybe .to_dict()["name"] (or we could have .to_dict() access a code_id)?

Probably the place to look for this functionality would be the classes that adapt the v2 compressor / filters to the v3 codecs api: https://github.com/zarr-developers/zarr-python/blob/fbd1658f1f95e0956a6ac294cf6a0b654841fb1c/src/zarr/codecs/_v2.py#L69

In v3 filters / codecs are stored as dicts, but in #2179 I switch to storing instances of numcodecs.abc.Codec, which i think would permit re-using the old object inspection code?

d-v-b commented 5 days ago

(xref to the issue tracking the top-level codecs / filters / compressor api): https://github.com/zarr-developers/zarr-python/issues/1943

TomAugspurger commented 5 days ago

Thanks for those links. I'll try to digest them and will propose a plan that'll either be compatibility code, or (more likely?) a PR to the migration guide.