Tree-aware dataset handling/selection

observingClouds commented 1 year ago

I'm looking for a good way to apply a function to a subset of nodes that share some common characteristics encoded in the subtree path.

Imagine the following data tree

import xarray as xr
import datatree
from datatree import map_over_subtree

dt = datatree.DataTree.from_dict({
    'control/lowRes' : xr.Dataset({'z':(('x'),[0,1,2])}),
    'control/highRes' : xr.Dataset({'z':(('x'),[0,1,2,3,4,5])}),
    'plus4K/lowRes' : xr.Dataset({'z':(('x'),[0,1,2])}),
    'plus4K/highRes' : xr.Dataset({'z':(('x'),[0,1,2,3,4,5])})
})

To apply a function to all control or all plus4K nodes is straight forward by just selecting the specific subtree, e.g. dt['control']. However, in case all lowRes dataset should be manipulated this becomes more elaborative and I wonder what the best approach would be.

dt['control/lowRes','plus4K/lowRes'] is not yet implemented and would also be complex for large data trees
dt['*/lowRes'] could be one idea to make the subtree selection more straight forward, where * is a wildcard
dt.search(regex) could make this even more general

Currently, I use the @map_over_subtree decorator, which also has some limitations as the function does not know its tree origin (as noted in the code) and it needs to be inferred from the dataset itself, which is sometimes possible (here the length of the dataset) but does not need to be always the case.

@map_over_subtree
def resolution_specific_func(ds):
    if len(ds.x) == 3:
        ds = ds.z*2
    elif len(ds.x) == 6:
        ds = ds.z*4
    return ds

z= resolution_specific_func(dt)

I do not know how the tree information could be passed through the decorator, but maybe it is okay if the DatasetView class has an additional property (e.g. _path) that could be filled with dt.path during the call of DatasetView._from_node()?. This would lead to

@map_over_subtree
def resolution_specific_func(ds):
    if 'lowRes' in ds._path:
        ds = ds.z*2
    if 'highRes' in ds._path:
        ds = ds.z*4
    return ds

and would allow for tree-aware manipulation of the datasets.

What do you think? Happy to open a PR if this makes sense.

dcherian commented 1 year ago

I like this idea: dt['*/lowRes']
I have an extract_leaf utility function that I like so far: dt.dc.extract_leaf('lowRes') This returns a new tree with two nodes'control' and 'plus4K'.
I've also wondered if a reorganize_nodes would be useful. This would take input in symbolic form so dt.reorganize_nodes('a/b -> b/a')would reorganize so
- lowRes
  - Control
  - plus4K
- highRes
- Control
- plus4K

observingClouds commented 1 year ago

Thanks for your comments! I also thought about the restructuring of the datatree similar to your point 3 and think it would be a nice addition. For this particular case I like 1 a bit more because it also allows for data trees with different depths.

TomNicholas commented 11 months ago

I have been thinking about adding similar functions for a while, thanks to both of you for laying out your ideas clearly! In fact your two-level-combinatoric tree structure @observingClouds is the same thing @jbusecke and I discussed in #186.

I like this idea: dt['*/lowRes']

I agree this could be useful, but it's a shortcut for the more obvious dt.search(regex), so I think I should add that method first and see how people get on with it.

I have an extract_leaf utility function that I like so far: dt.dc.extract_leaf('lowRes') This returns a new tree with two nodes 'control' and 'plus4K'.

I can also see this being useful, but I'm not totally sure what exact behavior you're after from your description. Does it return {/control/lowRes/, /plus4k/lowRes/}? Or just a shallower tree like {/control/, /plus4k/}? The idea is conceptually similar to using dt.leaves.search('lowRes'), but the types wouldn't currently work out if you tried that.

take input in symbolic form so dt.reorganize_nodes('a/b -> b/a')

@dcherian I think this suggestion of taking input in symbolic form might solve the ambuigities @jbusecke and I discussed in #186. The main questions I would have is what to do with nodes that don't have exactly 2 levels, i.e. a/ or a/b/c?

@observingClouds your DatasetView._path idea is a really interesting suggestion - I've opened a new issue for it in #266.

TomNicholas commented 11 months ago

I agree this could be useful, but it's a shortcut for the more obvious dt.search(regex), so I think I should add that method first and see how people get on with it.

Actually the simplest "searching" utility would be use glob, not regex. Added in #267.

dcherian commented 11 months ago

I would have is what to do with nodes that don't have exactly 2 levels, i.e. a/ or a/b/c?

Raise and error! there's an implied minimum depth necessary to do this?

TomNicholas commented 11 months ago

Raise and error! there's an implied minimum depth necessary to do this?

Right, but what if some nodes have this depth and others don't? I'm thinking of something heterogenous like:

demographics/ weather/highres/model1 weather/highres/model2 weather/lowres/model1 weather/lowres/model1

Actually I guess in this particular case the user would just need to apply the operation to the weather/* subtree... Is there a scenario in which some leaves are deeper than others? :thinking:

Thinking aloud more, @jbusecke 's CMIP6 case would look like:

model1/historical model1/scenario1 model1/scenario2 model2/historical model2/scenario1 model2/scenario2

which after dt.reorganize_nodes('a/b -> b/a') becomes

historical/model1 historical/model2 scenario1/model1 scenario1/model2 scenario2/model1 scenario2/model2

which seems fine

dcherian commented 11 months ago

Is there a scenario in which some leaves are deeper than others?

deeper than the symbolic depth is fine AFAICT. shallower than the symbolic depth should raise an error.

TomNicholas commented 11 months ago

deeper than the symbolic depth is fine AFAICT.

Is it? If I have /1/2/3/, and I ask for 'a/b -> b/a', isn't it ambiguous whether the result should be /1/3/2/ or /2/1/3/?

You could imagine maybe supporting something like '*/a/b -> */b/a', but it's getting more complex.

EDIT: Or maybe you call the method .reorganize_leaves and have it always give /1/3/2

EDIT2: Or maybe use a flag: dt.reorganize('a/b -> b/a', from_root=True)

dcherian commented 11 months ago

I have /1/2/3/, and I ask for 'a/b -> b/a', isn't it ambiguous whether the result should be /1/3/2/ or /2/1/3/?

Personally i'd just expect /2/1/3

TomNicholas commented 11 months ago

Personally i'd just expect /2/1/3

Well I personally would have expected the opposite :rofl:

Or maybe use a flag: dt.reorganize('a/b -> b/a', from_root=True)

I think this could be a good solution: 1) It will do something consistent by default, and if you don't like that default you can change it. 2) It's the minimum extra complexity to remove the root-first/leaves-first ambiguity. 3) It avoids using asterisks at all, which is good because they suggest string matching, which we are not doing here. 4) Avoiding asterisks prevents you from even expressing ambiguous requests, such as '*/a/b/* -> */b/a/*' 5) The fail cases are obvious: it will work for anything with a long enough path.

TomNicholas commented 11 months ago

Another idea: use preceding slashes to distinguish the two cases, i.e. '/a/b -> /b/a' would result in 2/1/3, but 'a/b -> b/a' would result in 1/3/2.

This has all the advatages of the the previous suggestion, but is more filesystem-like, doesn't require an extra kwarg, and allows moving nodes from root to leaves (e.g. '/a/b -> b/a').

dcherian commented 11 months ago

I would suggest '...a/b -> ...b/a' Otherwise it's really hard to see at first glance.

dcherian commented 11 months ago

I can also see this being useful, but I'm not totally sure what exact behavior you're after from your description.

dt.dc.extract_leaf('lowRes') returns a tree with two nodes Control and plus4K. This new tree is shallower, which then lets me pass it to another function that can't handle a deeper tree.

Here's my list of helper utilities: https://github.com/dcherian/dcpy/blob/master/dcpy/datatree.py

I found them all quite useful, but obviously there's little error checking :) and it does assume trees with a nicely balanced structure,.

TomNicholas commented 11 months ago

Here's my list of helper utilities

Thanks @dcherian !

dt.dc.extract_leaf('lowRes') returns a tree with two nodes Control and plus4K. This new tree is shallower, which then lets me pass it to another function that can't handle a deeper tree.

I think I get it now - that does seem potentially useful!

I found them all quite useful, but obviously there's little error checking

What's the input type for order is to reorder_nodes? Is it a list of integer indices?

it does assume trees with a nicely balanced structure,.

I suspect you might be talking about a concept I've been referring to as a "hollow tree". I'll raise a separate issue for that.

dcherian commented 11 months ago

What's the input type for order is to reorder_nodes? Is it a list of integer indices?

it expects a list of node names : e.g. tree.reorder(["TAO", ...]) reorders so that "TAO" is the first node. https://github.com/dcherian/dcpy/blob/7fb24b50410422ab9fc3257721c2727a075a34cb/dcpy/datatree.py#L24-L27

keewis commented 1 month ago

closed in favor of pydata/xarray#9346

xarray-contrib / datatree

Tree-aware dataset handling/selection #254