xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
161 stars 43 forks source link

Tree linkage + dataset in one object nonideal for UI tree view #303

Closed marcel-goldschen-ohm closed 5 months ago

marcel-goldschen-ohm commented 5 months ago

First, let me thank you for your efforts. I love the idea of a tree of xarray datasets, so kudos! However, I'm encountering a pretty annoying issue that I believe stems from 1) the design choice of having the tree linkage and dataset all wrapped in the same DataTree object, and 2) the current lack of support in xarray for renaming vars/coords "inplace". The issue is that (as far as I can tell) the only way to rename a var/coord is to create a new DataTree object (i.e., dt2 = dt1.rename_vars({'x': 'y'}) just as for datasets. This isn't really a problem for writing simple analysis scripts. However, when working with a UI such as a tree model/view to interface a datatree things get annoying. Imagine that your UI tree has items for both the datasets and the vars/coords within the datasets. Now the end user would like to edit the name of one of the variables. To do that you have to create a new DataTree node which then has to be updated for every item in the tree UI that references that node (i.e., every var/coord in that node that is displayed in the UI tree). This is a real pain which could be avoided by decoupling the dataset from the tree linkage, e.g., a tree node class which holds a dataset attribute so that the node references can persist during any renaming. The other advantage of this approach is that the dataset is then just a plain array dataset, so one does not have to think about compatibility or anything. You could still implement path selection, etc., via the tree node class. I know, this is a major design change, so I get it if it is unlikely to be implemented. However, a graphical tree UI for interfacing this sort of data structure is pretty desirable, and the current design is not ideal for that (basically, node references should persist for ease of use with the UI tree, which current renaming convention for xarray datasets and DataTree objects doesn't allow). Of course, it is possible that I just don't know the proper way to do this, so please correct me if I am off base here. Cheers!

TomNicholas commented 5 months ago

Hi @marcel-goldschen-ohm , thank you so much for your thoughtful feedback!

the design choice of having the tree linkage and dataset all wrapped in the same DataTree object

Original context for this decision is here https://github.com/xarray-contrib/datatree/issues/2#issuecomment-924217495

the current lack of support in xarray for renaming vars/coords "inplace"

I don't think this should be a huge barrier. Conceptually xarray Datasets are intended to be mostly treatable like dictionaries. I think we should be able to find some other solution that allows you to treat the variables in the way you want.

(as far as I can tell) the only way to rename a var/coord is to create a new DataTree object (i.e., dt2 = dt1.rename_vars({'x': 'y'}) just as for datasets

With a dataset you can also just do

In [3]: ds = xr.Dataset({'x': 1})

In [4]: ds
Out[4]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    x        int64 1

In [5]: ds.rename_vars({'x': 'y'})  # one approach
Out[5]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    y        int64 1

In [6]: ds['y'] = ds['x']  # but we can also just treat the ds like a dict

In [7]: ds
Out[7]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    x        int64 1
    y        int64 1

In [8]: ds.drop_vars('x')
Out[8]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    y        int64 1

In [9]: del ds['x']  # this also works

In [10]: ds
Out[10]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    y        int64 1

Is there a reason why that approach can't work in your case?

when working with a UI such as a tree model/view to interface a datatree things get annoying. Imagine that your UI tree has items for both the datasets and the vars/coords within the datasets.

Are you wrapping the DataTree class? If so then the code the UI calls can do anything to the tree object right?

marcel-goldschen-ohm commented 5 months ago

Thanks for the helpful suggestions @TomNicholas. The rename_vars and drop_vars won't work as they return new Dataset objects. However, I did not realize I could use the dict approach ds['y'] = ds['x'] and del ds['x'], which seems like it should do what I want. Thanks!

I haven't fully thought this through yet, but I question as to whether a dict approach can handle all of the possible operations with Datasets, because it seems like a lot of them return new Dataset objects. e.g., merge, combine, etc. So while your solution for renaming vars should suffice, I'm not fully convinced that it solves the root issue entirely.

I am indeed using a UI tree node class which wraps references to the DataTree. My issue is not that I cannot deal with this at all, it is simply that having to rebuild the entire UI tree (or at least a portion of it) every time an operation changes a DataTree ref in a way that should intuitively not require a UI tree rebuild (rename, align, merge?, etc.) is annoying. However, the dict rename solution will go a long way to mitigating this, so maybe not as painful as I'd first thought. We'll see how it pans out. Thanks again for the suggestion.

TomNicholas commented 5 months ago

I question as to whether a dict approach can handle all of the possible operations with Datasets, because it seems like a lot of them return new Dataset objects. e.g., merge, combine, etc

The reason xarray does almost everything through methods that return new objects is (a) to avoid unintuitive copying behavior under-the-hood with in-place operations and (b) to enable method chaining. I actually just now found this rather interesting blog post discussing this rationale in pandas (which inspired xarray).

So while your solution for renaming vars should suffice, I'm not fully convinced that it solves the root issue entirely.

You know you can turn the data in a datatree node into a standalone Dataset, do whatever you want to that, and then attach it back to the same node of the same tree, right? i.e.

ds = dt['pick/a/node'].to_dataset()

new_ds = rename_whatever(ds)

dt['pick/a/node'].ds = new_ds

No new tree is created when you do this, you just assign new variables to the specific node of the tree. With that pattern I would say this becomes completely an xarray usage question, not a datatree usage question.

marcel-goldschen-ohm commented 5 months ago

Oh wow, this is exactly what I wanted. Nice!

Perhaps I just did not read the docs closely enough, but I thought DataTree.ds returned an immutable view, so it was not obvious to me that I could assign to it. This is perfect.

TomNicholas commented 5 months ago

Perhaps I just did not read the docs closely enough

It's mentioned in the docs, but not documented super well...

I thought DataTree.ds returned an immutable view, so it was not obvious to me that I could assign to it.

It does return an immutable view, but you can still assign to the .ds property... You need to use DataTree.to_dataset() to get something mutable first. Notice the difference in types:

In [3]: ds = xr.Dataset({'x': 1})

In [4]: dt = DataTree(data=ds)

In [5]: dt
Out[5]: 
DataTree('None', parent=None)
    Dimensions:  ()
    Data variables:
        x        int64 1

In [6]: type(dt.ds)
Out[6]: datatree.datatree.DatasetView

In [7]: type(dt.to_dataset())
Out[7]: xarray.core.dataset.Dataset

In [8]: ds_view = dt.ds

In [9]: ds_copy = dt.to_dataset()

In [10]: ds_copy['y'] = ds['x']  # works fine because the copy is a mutable xr.Dataset

In [11]: ds_view['y'] = ds['x']  # forbidden
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], line 1
----> 1 ds_view['y'] = ds['x']

File ~/Documents/Work/Code/datatree/datatree/datatree.py:157, in DatasetView.__setitem__(self, key, val)
    156 def __setitem__(self, key, val) -> None:
--> 157     raise AttributeError(
    158         "Mutation of the DatasetView is not allowed, please use `.__setitem__` on the wrapping DataTree node, "
    159         "or use `dt.to_dataset()` if you want a mutable dataset. If calling this from within `map_over_subtree`,"
    160         "use `.copy()` first to get a mutable version of the input dataset."
    161     )

AttributeError: Mutation of the DatasetView is not allowed, please use `.__setitem__` on the wrapping DataTree node, or use `dt.to_dataset()` if you want a mutable dataset. If calling this from within `map_over_subtree`,use `.copy()` first to get a mutable version of the input dataset.

In [12]: dt['y'] = ds['x']  # you can do this though

I went for this design because most of the time the distinction between the copy and the view doesn't matter, and an immutable view is a lot easier to make safe (in the sense of not accidentally creating linked objects with inconsistent state, see https://github.com/xarray-contrib/datatree/issues/38#issuecomment-912899051 and https://github.com/xarray-contrib/datatree/pull/99 if you're interested). Your use case happens to be one where the distinction very much matters.

This is perfect.

Great! I will close this now then.

TomNicholas commented 5 months ago

Actually you're right, I don't know if the docs currently mention anywhere that assigning to .ds is allowed!