xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
162 stars 43 forks source link

Lack of resilience towards missing `_ARRAY_DIMENSIONS` xarray's special zarr attribute #280

Open eschalkargans opened 7 months ago

eschalkargans commented 7 months ago

Hello,

Bug Description

I am currently experimenting with datatree (xarray-datatree==0.0.13) to open a Zarr folder.

I assume that datatree should be able to open all of the Zarr files. However, in the current situation, it seems that datatree can only open zarr files that were generated with xarray. Indeed, when the _ARRAY_DIMENSIONS attribute is missing from the metadata contained in the .zmetadata file present at the root of the Zarr, datatree is unable to load the Zarr file. A KeyError: '_ARRAY_DIMENSIONS' is thrown.

Reproduce the Bug

You can find in the following gist a small python script reproducing the issue:

https://gist.github.com/eschalkargans/6c8708370ad6b7b58eebe95aa95084ab

Here is the sequence:

Discussion

Because of these choices, Xarray cannot read arbitrary array data, but only Zarr data with valid _ARRAY_DIMENSIONS or NCZarr attributes on each array (NCZarr dimension names are defined in the .zarray file).

More information about _ARRAY_DIMENSIONS: Zarr Encoding Specification

The documentation explicitly states that Xarray cannot read arbitrary array data. So, this issue is more a feature request than a bug description. It is currently expected that such files are not readable.

However, developers may find themselves at one point or another with plain Zarr files that are incompatible with the current xarray implementation. So, I think there should be a way to open these Zarr files with no dimension-names. Maybe the user can provide themselves a mapping for missing dimensions, eg

_only missing attributes, merging the read .zmetadata with the user-provided _array_dimensions_

open_datatree(zarr_path, engine="zarr", _array_dimensions={
    "z": "z"
})

or even proposing a full mapping from path of the variable into the Zarr hierarchy to their list of dimension names:

open_datatree(zarr_path, engine="zarr", _array_dimensions={
    "z": "z", "label": "label", "my_xda": ["label", "z"]
})

Or, maybe do you wait for an update of the Zarr specification in the future that would fully incorporate named dimensions? In that case, what strategy would you recommend for users of datatree to fix their Zarr? Updating directly the .zmetadata?


Thanks!

TomNicholas commented 7 months ago

This is definitely an xarray-level issue, not a datatree-specific issue. All datatree does is open each group of a zarr store using xarray.open_dataset and put them in a tree.

However, developers may find themselves at one point or another with plain Zarr files that are incompatible with the current xarray implementation. So, I think there should be a way to open these Zarr files with no dimension-names.

I have some thoughts about this but I think you should re-raise it on the xarray issue tracker instead!