xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
161 stars 43 forks source link

Creating DataTree from DataArrays #320

Open Sevans711 opened 4 months ago

Sevans711 commented 4 months ago

What are the best ways to create a DataTree from DataArrays?

I don't usually work with Dataset objects, but rather with DataArray objects. I want the option to re-use the (physics-motived) arithmetic throughout my code, but manipulating multiple DataArrays at once. I initially thought about combining those DataArrays into a Dataset, but then learned that is not really a sufficient solution (more details here). DataTree seems like a better solution!

The only issue I am having right now is that DataTree seems to be adding more complexity than I need, because it is converting everything into Datasets, instead of allowing me to have a tree of DataArrays. This is not a deal-breaker for me, however it definitely increases the mental load of using the DataTree code. One thing that would help significantly would be an easy way to create a DataTree from a list or iterable of arrays.

Here are my suggestions. Please let me know if there are already easy ways to do this which I just didn't find yet!

1 - Improve DataTree.from_dict

There is currently DataTree.from_dict, however this fails for unnamed DataArrays (when using datatree.__version__=='0.0.14'). For example:

import numpy as np
import xarray as xr
import datatree as xrd

# generate some 3D data
nx, ny, nz = (32, 32, 32)
dx, dy, dz = (0.1, 0.1, 0.1)
coords=dict(x=np.arange(nx)*dx, y=np.arange(ny)*dy, z=np.arange(nz)*dz)
array = xr.DataArray(np.random.random((nx,ny,nz)), coords=coords)

# take some 2D slices of the 3D data
arrx = array.isel(x=0)
arry = array.isel(y=0)
arrz = array.isel(z=0)

# combine those slices into a single object so we can apply operations to all arrays at once,
#    using the same interface & code for applying operations to a single DataArray!
tree = xrd.DataTree.from_dict(dict(x0=arrx, y0=arry, z0=arrz))

This fails, with: ValueError: unable to convert unnamed DataArray to a Dataset without providing an explicit name

However, it will instead succeed if I give names to the arrays:

arrx.name = 'x0'
arry.name = 'y0'
arrz.name = 'z0'
tree = xrd.DataTree.from_dict(dict(x0=arrx, y0=arry, z0=arrz))

I suggest to improve the DataTree.from_dict method so that it will instead succeed if provided unnamed DataArrays. The behavior in this case should be to construct the Dataset from each DataArray but using the key provided in from_dict as the DataArray.name, for any DataArray without a name.

2 - Add a DataTree.from_arrays method

Ideally, I would like to be able to do something like (using arrays from the example above):

tree = xrd.DataTree.from_arrays([arrx, arry, arrz])

This should work for any list of named arrays, and can use the array names to infer keys for the resulting DataTree. It will probably be easy to implement, I am thinking something like this:

@classmethod
def from_arrays(cls, arrays):
    if any(getattr(arr, name, None) is None for arr in arrays):
        raise Exception('from_arrays requires all arrays to have names!')
    d = {arr.name: arr for arr in arrays}
    return cls.from_dict(d)

3 - Allow DataTree data to be DataArrays instead of requiring that they are Datasets

This would be excellent for me as an end-user, however I am guessing it would be really difficult to implement. I would understand if this is not feasible!