xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
164 stars 43 forks source link

Tree "broadcasting" #199

Open TomNicholas opened 1 year ago

TomNicholas commented 1 year ago

Currently you can perform arithmetic with datatrees, e.g. dt + dt. (In fact the current implementation lets you apply arbitrary operations on n trees that return 1 to n new trees, see map_over_subtree.)

However currently these trees must have the same structure of nodes (i.e. be "isomorphic").

It would be useful to generalise tree operations to handle trees of different structure. I'm going to call this "tree broadcasting" (not to be confused with array broadcasting).

I think this is the biggest unsolved design question with datatree.

TomNicholas commented 1 year ago

Motivation:

When using datatree objects for analysis of hierarchical data, one sometimes wants to apply operations to only parts of a tree, subtrees, or all trees matching a condition. These operations can be achieved in a granular fashion using __getitem__, .subtree, .filter etc. (see also #79), but for large trees sometimes users want binary functions to "broadcast" over parts of the tree automatically.

@jbusecke's work using datatree to analyse CMIP6 models provides multiple examples of use cases, for example calculating climate anomaly by subtracting a model-specific historical bias from an ensemble of many models, scenarios, and parameters.

Current behaviour:

Currently I made tree-tree operations error in all but a couple of cases.

1) dt * ds

When you multiply (I'll use multiplication as a stand-in for any binary operation) a tree by a dataset then currently the dataset is used to multiply every single non-empty node in the tree. The same thing also happens with a scalar or single array, which I think is intuitive.

(Note that whilst this operation should be commutative, until #146 is fixed then ds * dt will actually not return a tree like it should. This requires upstream changes to xarray to fix.)

2) dt * dt with isomorphic trees

This currently multiplies the datasets contained in each node node-wise. This only makes sense if the two trees are isomorphic. Otherwise you don't have an obvious one-to-one correspondence between pairs of nodes across the two trees.

This generalises to dt * dt * dt ..., so long as all the trees are mutually isomorphic.

Constraints / requirements:

Non-goals:

TomNicholas commented 1 year ago

Some useful concepts:

TomNicholas commented 1 year ago

Ideas:

I discussed this algorithm design problem at length with @jbusecke , @cmdupuis3, and my friends Peter (a graph theory math PhD student) and Galen (a sociologist who works with graphs). If anyone else has input it would be appreciated.

We are still in progress, but we think we concluded a few things:

1) Align trees, then apply operation node-wise 2) To align trees, first do either tree union or intersection 3) Broadcasting involves copying data onto nodes that were created by union 4) Broadcasting is much easier to reason about for hollow trees 5) Hollow trees can cover quite a lot of use cases 6) Complicated algorithms admit counter-intuitive (or inconsistent) behaviour

TomNicholas commented 1 year ago

@shoyer suggested that as any automatic tree-broadcasting method should be composed of well-defined individual steps, it would be wise to expose public API for those steps first. Then we can see if (a) those are sufficient for people's needs, and (b) if users generally agree that tree broadcasting should work in one specific way, or whether opinions diverge.


A specific example would be making functions for union and intersection. These would be analogous to xr.merge, accepting multiple trees and returning the same number of trees, but with their node structure altered to match.

intersection is actually quite simple - there is only one way to implement it.

union is harder because you have to decide what to add at the positions of the new nodes. That brings you to the concept of "broadcasting" mentioned above, and you might imagine different options for "pushing" data to nodes, or "splitting" existing nodes.

cmdupuis3 commented 1 year ago

I think that's similar to what I was saying, along the lines of separating your graph operations from your data operations. If we're just looking at the primitives (rather than going for "convenience"), we don't have to worry about what kind of algebra the user is expecting. You'd probably want those primitives for a more advanced API anyway.

From that POV, a graph union would be simple also, you just combine all the nodes. From there, you could select what data operations you want (copying, etc.)