Tree "broadcasting" - Githubissues

TomNicholas commented 1 year ago

Currently you can perform arithmetic with datatrees, e.g. dt + dt. (In fact the current implementation lets you apply arbitrary operations on n trees that return 1 to n new trees, see map_over_subtree.)

However currently these trees must have the same structure of nodes (i.e. be "isomorphic").

It would be useful to generalise tree operations to handle trees of different structure. I'm going to call this "tree broadcasting" (not to be confused with array broadcasting).

I think this is the biggest unsolved design question with datatree.

TomNicholas commented 1 year ago

Motivation:

When using datatree objects for analysis of hierarchical data, one sometimes wants to apply operations to only parts of a tree, subtrees, or all trees matching a condition. These operations can be achieved in a granular fashion using __getitem__, .subtree, .filter etc. (see also #79), but for large trees sometimes users want binary functions to "broadcast" over parts of the tree automatically.

@jbusecke's work using datatree to analyse CMIP6 models provides multiple examples of use cases, for example calculating climate anomaly by subtracting a model-specific historical bias from an ensemble of many models, scenarios, and parameters.

Current behaviour:

Currently I made tree-tree operations error in all but a couple of cases.

1) dt * ds

When you multiply (I'll use multiplication as a stand-in for any binary operation) a tree by a dataset then currently the dataset is used to multiply every single non-empty node in the tree. The same thing also happens with a scalar or single array, which I think is intuitive.

(Note that whilst this operation should be commutative, until #146 is fixed then ds * dt will actually not return a tree like it should. This requires upstream changes to xarray to fix.)

2) dt * dt with isomorphic trees

This currently multiplies the datasets contained in each node node-wise. This only makes sense if the two trees are isomorphic. Otherwise you don't have an obvious one-to-one correspondence between pairs of nodes across the two trees.

This generalises to dt * dt * dt ..., so long as all the trees are mutually isomorphic.

Constraints / requirements:

Tree operations reduce to dataset operations - The design model of DataTree makes the contents of one node equivalent to one xr.Dataset. Operations on multiple nodes from different trees should be equivalent to operations on multiple xr.Dataset objects.
Respect commutativity
Work for n trees
Be intuitive!
Don't make silent choices for user - if operating with the two trees is ambiguous in some way, we should prefer throwing an error over silently making a choice for the user.
Don't have multiple options / flags - Would could imagine various options (e.g. analogous to inner or outer joins) but the intuitive syntax dt * dt doesn't leave space to specify these options as flags.
Follow behaviour (1) and (2) above - I think those are both intuitive and should be kept if possible.
Facilitate specific use cases - in #198 @jbusecke has begun writing out specific cases which we want tree broadcasting to enable automatically. We should collect these and write them out in this issue though.

Non-goals:

Doesn't need to support all tree structures - It's fine if for certain types of tree structure the broadcasting raises an error. We would just want the cases in which errors are raised to be well-defined and foreseeable.

TomNicholas commented 1 year ago

Some useful concepts:

Isomorphic trees - A set of trees for which all corresponding nodes have the same number of children. Basically trees with the same structure of nodes, but potentially different data in each node. Can be defined with or without checking that corresponding nodes have the same name ([see the docstring of assert_isomorphic]()) but we have found that unless you assume the names of the nodes are the same then you can't get very far.
Alignment of trees - Adding new nodes (probably by duplication) to two trees until they become isomorphic. Analogous to broadcasting two arrays.
Unions / intersections of trees - Aligning trees against one another by computing the union/intersection of the set of node paths in the two trees.
Node-wise function application - Given two isomorphic trees, applying a given function to corresponding pairs (or triplets etc..) of node.ds.
Broadcasting over subtrees - Mapping a single node's ds over all the nodes in a subtree. The ds * dt operation is an example of this, but you can imagine doing a similar mapping to just a part of a larger tree.
Empty nodes - A node which contains no .variables. We have a convenience .has_data property for checking this.
Leaves - Nodes which have no children. Can get a list of them via dt.leaves.
Hollow trees - Trees where only leaf nodes contain data. Easier to reason about than "non-hollow" trees. A little bit like databases in that usually a set of tags (components of paths) point intuitively to a single dataset.

TomNicholas commented 1 year ago

Ideas:

I discussed this algorithm design problem at length with @jbusecke , @cmdupuis3, and my friends Peter (a graph theory math PhD student) and Galen (a sociologist who works with graphs). If anyone else has input it would be appreciated.

We are still in progress, but we think we concluded a few things:

1) Align trees, then apply operation node-wise 2) To align trees, first do either tree union or intersection 3) Broadcasting involves copying data onto nodes that were created by union 4) Broadcasting is much easier to reason about for hollow trees 5) Hollow trees can cover quite a lot of use cases 6) Complicated algorithms admit counter-intuitive (or inconsistent) behaviour

TomNicholas commented 1 year ago

@shoyer suggested that as any automatic tree-broadcasting method should be composed of well-defined individual steps, it would be wise to expose public API for those steps first. Then we can see if (a) those are sufficient for people's needs, and (b) if users generally agree that tree broadcasting should work in one specific way, or whether opinions diverge.

A specific example would be making functions for union and intersection. These would be analogous to xr.merge, accepting multiple trees and returning the same number of trees, but with their node structure altered to match.

intersection is actually quite simple - there is only one way to implement it.

union is harder because you have to decide what to add at the positions of the new nodes. That brings you to the concept of "broadcasting" mentioned above, and you might imagine different options for "pushing" data to nodes, or "splitting" existing nodes.

cmdupuis3 commented 1 year ago

I think that's similar to what I was saying, along the lines of separating your graph operations from your data operations. If we're just looking at the primitives (rather than going for "convenience"), we don't have to worry about what kind of algebra the user is expecting. You'd probably want those primitives for a more advanced API anyway.

From that POV, a graph union would be simple also, you just combine all the nodes. From there, you could select what data operations you want (copying, etc.)

xarray-contrib / datatree

Tree "broadcasting" #199

Motivation:

Current behaviour:

Constraints / requirements:

Non-goals:

Some useful concepts:

Ideas: