zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
68 stars 10 forks source link

Replace VirtualiZarr.ZArray with zarr.array.Array #175

Open ghidalgo3 opened 4 days ago

ghidalgo3 commented 4 days ago

Before going further with these changes, I'd like to discuss if this direction is correct or not. Based on #17, I tried using Zarr-V3's Array definition instead of VirtualiZarr's ZArray definition. Keeping the names straight is hard, ZArray is the current code and Array is zarr's type.

The problems I ran into were:

  1. The set of allowed kwargs for Array are different between v2 and v3 for zarr, but VirtualiZarr.ZArray I believe is first created with the v2 properties and then transformed to v3 format when zarr_v3_array_metadata is called. After the replacement, anywhere VirtualiZarr creates a Array, one must choose v2 or v3 kwargs, or again introduce something that abstracts over the versions. Unless VirtualiZarr chooses to only support Zarr V3 only, which I believe is the intention.
  2. During concatenation, VirtualiZarr checks if the codecs between ZArrays all match, which was easier when ZArray declared the codecs. Now, if you support v2 and v3 then the code checking logic has to branch based on the type of zarr.array.metadata because the internal zarr property names have changed. Probably there will be other code-level incompatibilities from metadata property name differences, I think #94 ran into this too.

This is just from a few hours of trying this out.

If VirtualiZarr needs to support both Zarr V2 and V3 stores, then this change needs to handle all the little differences in Zarr's API between 2 and 3. But if only V3 is supported, this becomes easier and replacing ZArray with Array should not be too difficult.

Thoughts?

Incompatibilities:

rabernat commented 4 days ago

Just wanted to say that the Zarr V3 class behaviors are not set in stone yet. If you have feedback on the V3 API and potential incompatibilities between V2, please report them issues on Zarr-python: https://github.com/zarr-developers/zarr-python/issues

jsignell commented 4 days ago

My sense is that in terms of file format both zarr v2 and zarr v3 should be supported but in terms of library dependencies it would be fine to mandate that only zarr-python >= 3 is supported.

TomNicholas commented 4 days ago

Thanks for taking the initiative here @ghidalgo3 !

I tried using Zarr-V3's Array definition instead of VirtualiZarr's ZArray definition

I was actually hoping we could use some class from zarr-python that wasn't a fully-featured zarr.Array, instead using something that just represented the metadata for one array. This ArrayV3Metadata class might be what we need? Is that public API?

If VirtualiZarr needs to support both Zarr V2 and V3 stores, then this change needs to handle all the little differences in Zarr's API between 2 and 3. But if only V3 is supported, this becomes easier and replacing ZArray with Array should not be too difficult.

I see two ways to do this:

  1. Make VirtualiZarr internally agnostic to v2 vs v3. Then a ManifestArray can represent either a v2 array or a v3 array, distinguished via its metadata. For example the attributes of a ManifestArray might become

    from zarr.metadata import ArrayMetadata
    
    class ManifestArray:
        _chunkmanifest: ChunkManifest
        _zarray: ArrayMetadata

    where ArrayMetadata is an ABC that needs to be a concrete ArrayV2Metadata or ArrayV3Metadata.

  2. VirtualiZarr internally only uses v3 arrays, and whilst we support reading zarr and writing zarr v2 arrays, they are coerced at reading / writing time.

I'm not sure which of these two is the best approach.

I also completely agree with @jsignell - all of this should be supported through only a dependency on zarr>=3.0.0, as this PR adds.