Closed TomNicholas closed 1 week ago
This is an interesting idea, and would actually be fairly easy to do - I would just have to change the type passed to the user function to be a DatasetView
and then I can always have a DatasetView know its .path
, as it's always created from a node. (This would re-open #188, but I think there is a better solution for that anyway.)
One reason not to implement this feature is that it might encourage people to write dataset-processing functions that only work when mapped over a tree, and not over normal Dataset
objects (because they call .path
).
To be clear, an alternative to your suggestion @observingclouds would be requiring the user to pass a function with a different signature to map_over_subtree
, i.e.
def my_func(path, ds):
...
rather than
def my_func(ds):
path = ds.path
...
Thanks for your additional thoughts @TomNicholas! Would it maybe be best to integrate the path into the map_over_subtree
decorator and forward it there to the function? This way the path is only available when calling the decorator function and not when accessing the DatasetView
potentially limiting the "misuse" of the path information.
Yes perhaps... It would help prevent anti-patterns to have .ds.path
only available from inside map_over_subtree
. But how would I even do that? An attribute that is public but only ever used in one scenario? A different type of DatasetView
object whose only difference is the extra property? Both of those feel kind of gross to me... :confused:
Well, how would you implement:
def my_func(ds, path):
...
One way to do it could be to add a keyword argument to the map_over_subtree
decorator, like include_path
.
Not sure if _handle_errors_with_path_context is the best place to ingest the path information but that could be a possibility:
def _handle_errors_with_path_context(path, include_path=False):
"""Wraps given function so that if it fails it also raises path to node on which it failed."""
def decorator(func):
def wrapper(*args, **kwargs):
+ if include_path: kwargs['path'] = path
try:
return func(*args, **kwargs)
except Exception as e:
There are probably nicer places to do this though.
how would you implement:
def my_func(ds, path): ...
I was going to say that I would simply change map_over_subtree
to require this signature - I don't care about backwards-compatibility until I've merged this upstream in xarray :laughing:
But actually there is a better reason not to do it like that, which is that map_over_subtree
can handle functions which act on multiple datasets. There is no requirement that the paths be the same for each dataset passed to the function, only that their container trees are isomorphic. For example you can do this:
dt1 = DataTree.from_dict({'model1/highres': xr.Dataset({'temp': ...})})
dt2 = DataTree.from_dict({'model2/highres': xr.Dataset({'temp': ...})})
dt1 * dt2 # will happily multiply `model1/temp` by `model2/temp`, despite different paths
We don't want a signature like
def my_func(path1, ds1, path2, ds2, ...):
...
The function you point to above (_handle_errors_with_path_context
) actually completely ignores this detail, just because it seemed like overkill for an error message.
The function you point to above (
_handle_errors_with_path_context
) actually completely ignores this detail, just because it seemed like overkill for an error message.
On reflection it's not overkill - imagine the function I was mapping was xr.concat
(a distinct possibility in https://github.com/xarray-contrib/datatree/issues/192), then I would really want to know exactly which two nodes had incompatible datasets.
Closing in favour of https://github.com/pydata/xarray/issues/9346
Raised by @observingClouds in https://github.com/xarray-contrib/datatree/issues/254#issue-1835784457
Currently, I use the @map_over_subtree decorator, which also has some limitations as the function does not know its tree origin (as noted in the code) and it needs to be inferred from the dataset itself, which is sometimes possible (here the length of the dataset) but does not need to be always the case.
I do not know how the tree information could be passed through the decorator, but maybe it is okay if the
DatasetView
class has an additional property (e.g._path
) that could be filled withdt.path
during the call of DatasetView._from_node()?. This would lead toand would allow for tree-aware manipulation of the datasets.