Open Illviljan opened 11 months ago
Good idea @Illviljan !
I'm a little unsure how to get the parallel-argument down to map_over_subtree though?
Do you actually need to pass it through at all? Couldn't you just do this:
def map_over_subtree(func: Callable, parallel=False) -> Callable:
@functools.wraps(func)
def _map_over_subtree(*args, **kwargs) -> DataTree | Tuple[DataTree, ...]:
from .datatree import DataTree
if parallel:
import dask
or ideally just do this optimization automatically (if dask is installed I guess)?
I'm wondering how xarray normally does this optimization when you apply an operation to every data variable in a Dataset, for instance. Is it related to #196?
I tried a version with parallel as an argument but it isn't passed correctly via the normal methods: dt.interp(time=new_time, parallel=True)
errors because it thinks parallel
is a coordinate.
Maybe we could always use this optimization. Dask usually adds some overhead though, and I just haven't played around enough to know where that threshold is or if it is significant.
I'm wondering how xarray normally does this optimization when you apply an operation to every data variable in a Dataset, for instance. Is it related to #196?
I think the only place this trick is used is xr.open_mfdataset
. Not sure why though, maybe most xarray methods predates dask.delayed
?
I also have a feeling my datasets with 2000+ variables is not the normal setup for most xarray users, so there's probably not been a need to optimize in the variable direction.
I don't fully understand all the changes in #196, I see that one as being able to trigger computation of all the dask arrays inside the DataArrays. My suggestion is earlier in that chain; setting up those chunked DataArrays in parallel.
You have real datasets with 2000+ variables?!?
Now that I understand that this is not about triggering computation of dask arrays but about building the dask arrays in parallel, I'm less sure that this is a good idea.
I guess one way to look at it is through consistency: DataTree.map_over_subtree
is very much a generalization of xarray's Dataset.map
, just mapping over nested dictionaries of data variables instead of a single-level dict of data variables. As such I think that we should be consistent in how we treat these two implementations - either it makes sense to apply this optimization in both Dataset.map
and DataTree.map_over_subtree
, or to neither of them, because it's out-of-scope/too much overhead in both cases.
Yes, the example code is quite realistic. That's my type of datasets, and there's still always something missing...
Dataset.map
looks very lightweight compared to Dataset.interp
and DataTree.map_over_subtree
handles both. Some functions are heavier and needs to be treated differently and therefore it's good to have the option of parallelization.
Dataset.map looks very lightweight compared to Dataset.interp and DataTree.map_over_subtree handles both.
Are you saying that we already do some parallelization like this within Dataset.interp
?
We discussed this in the xarray dev call today briefl. Stephan had a few comments, chiefly that he would be surprised if this gave significant speedup in most cases because of restrictions imposed by the GIL. Possibly once python removes the GIL we might want to revisit this question for all of xarray.
I think there's some good opportunities to run
map_over_subtree
in parallel usingdask.delayed
.Consider this example data:
Here's my modded
map_over_subtree
:I'm a little unsure how to get the parallel-argument down to
map_over_subtree
though?