Open TomNicholas opened 1 year ago
When using datatree objects for analysis of hierarchical data, one sometimes wants to apply operations to only parts of a tree, subtrees, or all trees matching a condition. These operations can be achieved in a granular fashion using __getitem__
, .subtree
, .filter
etc. (see also #79), but for large trees sometimes users want binary functions to "broadcast" over parts of the tree automatically.
@jbusecke's work using datatree to analyse CMIP6 models provides multiple examples of use cases, for example calculating climate anomaly by subtracting a model-specific historical bias from an ensemble of many models, scenarios, and parameters.
Currently I made tree-tree operations error in all but a couple of cases.
1) dt * ds
When you multiply (I'll use multiplication as a stand-in for any binary operation) a tree by a dataset then currently the dataset is used to multiply every single non-empty node in the tree. The same thing also happens with a scalar or single array, which I think is intuitive.
(Note that whilst this operation should be commutative, until #146 is fixed then ds * dt
will actually not return a tree like it should. This requires upstream changes to xarray to fix.)
2) dt * dt
with isomorphic trees
This currently multiplies the datasets contained in each node node-wise. This only makes sense if the two trees are isomorphic. Otherwise you don't have an obvious one-to-one correspondence between pairs of nodes across the two trees.
This generalises to dt * dt * dt ...
, so long as all the trees are mutually isomorphic.
xr.Dataset
. Operations on multiple nodes from different trees should be equivalent to operations on multiple xr.Dataset
objects.dt * dt
doesn't leave space to specify these options as flags.assert_isomorphic
]()) but we have found that unless you assume the names of the nodes are the same then you can't get very far.node.ds
.ds * dt
operation is an example of this, but you can imagine doing a similar mapping to just a part of a larger tree..variables
. We have a convenience .has_data
property for checking this.dt.leaves
.I discussed this algorithm design problem at length with @jbusecke , @cmdupuis3, and my friends Peter (a graph theory math PhD student) and Galen (a sociologist who works with graphs). If anyone else has input it would be appreciated.
We are still in progress, but we think we concluded a few things:
1) Align trees, then apply operation node-wise 2) To align trees, first do either tree union or intersection 3) Broadcasting involves copying data onto nodes that were created by union 4) Broadcasting is much easier to reason about for hollow trees 5) Hollow trees can cover quite a lot of use cases 6) Complicated algorithms admit counter-intuitive (or inconsistent) behaviour
@shoyer suggested that as any automatic tree-broadcasting method should be composed of well-defined individual steps, it would be wise to expose public API for those steps first. Then we can see if (a) those are sufficient for people's needs, and (b) if users generally agree that tree broadcasting should work in one specific way, or whether opinions diverge.
A specific example would be making functions for union
and intersection
. These would be analogous to xr.merge
, accepting multiple trees and returning the same number of trees, but with their node structure altered to match.
intersection
is actually quite simple - there is only one way to implement it.
union
is harder because you have to decide what to add at the positions of the new nodes. That brings you to the concept of "broadcasting" mentioned above, and you might imagine different options for "pushing" data to nodes, or "splitting" existing nodes.
I think that's similar to what I was saying, along the lines of separating your graph operations from your data operations. If we're just looking at the primitives (rather than going for "convenience"), we don't have to worry about what kind of algebra the user is expecting. You'd probably want those primitives for a more advanced API anyway.
From that POV, a graph union would be simple also, you just combine all the nodes. From there, you could select what data operations you want (copying, etc.)
Currently you can perform arithmetic with datatrees, e.g.
dt + dt
. (In fact the current implementation lets you apply arbitrary operations on n trees that return 1 to n new trees, seemap_over_subtree
.)However currently these trees must have the same structure of nodes (i.e. be "isomorphic").
It would be useful to generalise tree operations to handle trees of different structure. I'm going to call this "tree broadcasting" (not to be confused with array broadcasting).
I think this is the biggest unsolved design question with datatree.