Open TomNicholas opened 7 months ago
One thing I realized about this is that concatenating multiple DataTree
objects is currently a little awkward. I don't know if this is actually a common pattern, but imagine you had two netCDF files each with groups, and you wanted to concatenate group1
in file1
with group1
in file2
etc.
Adding open_virtual_datatree
would allow you to open the files like this:
vdt1 = open_virtual_datatree('file1.nc')
vdt1 = open_virtual_datatree('file2.nc')
but currently you can't do
combined_vdt = xr.concat([vdt1, vdt1], dim='time')
because xr.concat
doesn't understand DataTree
objects. To get around this you should be able to do
from datatree import map_over_subtree
concat_datatrees = map_over_subtree(xr.concat)
combined_vdt = concat_datatrees([vdt1, vdt1], dim='time')
but it raises the question of whether the xarray DataTree upstream integration should include generalizing xr.concat
.
Hi @TomNicholas
I often have the problem that I want to concat different datatrees into a single xr.Dataset again. I came across your code above and tried it, but I get an error.
Generate some sample data:
import datatree
import xarray as xr
ds1 = xr.Dataset(
data_vars=dict(a=("x", [11, 22, 33])),
coords=dict(x=[1,2,3])
)
ds2 = xr.Dataset(
data_vars=dict(a=("x", [111, 222, 333])),
coords=dict(x=[1,2,3])
)
mytree = datatree.DataTree.from_dict({"two_digits": ds1, "three_digits": ds2})
print(mytree)
output:
DataTree('None', parent=None)
├── DataTree('two_digits')
│ Dimensions: (x: 3)
│ Coordinates:
│ * x (x) int64 24B 1 2 3
│ Data variables:
│ a (x) int64 24B 11 22 33
└── DataTree('three_digits')
Dimensions: (x: 3)
Coordinates:
* x (x) int64 24B 1 2 3
Data variables:
a (x) int64 24B 111 222 333
I tried:
# fails
concat_datatrees = datatree.map_over_subtree(xr.concat)
ds_concatenated = concat_datatrees(mytree, dim='digits')
# also fails:
# ds_concatenated = concat_datatrees([mytree.two_digits, mytree.three_digits], dim='digits')
output:
TypeError: can only concatenate xarray Dataset and DataArray objects, got <class 'str'>
Raised whilst mapping function over node with path /two_digits"
}
this would work, but it is not really nice:
# works
ds_concatenated = xr.concat([mytree[subtree].ds for subtree in mytree], dim="digits")
print(ds_concatenated)
output:
<xarray.Dataset> Size: 72B
Dimensions: (digits: 2, x: 3)
Coordinates:
* x (x) int64 24B 1 2 3
Dimensions without coordinates: digits
Data variables:
a (digits, x) int64 48B 11 22 33 111 222 333
Do you have a suggestion of how to deal with such situations? Thanks!
Thanks for trying this @jonas-spaeth ! I now realise that I didn't think hard enough before making this suggestion 😅
This is an issue with xarray-datatree, not with virtualizarr at all, so I will re-raise this on the xarray repo instead and we can continue discussion there.
# fails concat_datatrees = datatree.map_over_subtree(xr.concat) ds_concatenated = concat_datatrees(mytree, dim='digits')
I think your first attempt is just an incorrect use of the map_over_subtree
decorator, as calling xr.concat(mytree.ds, dim='digits')
wouldn't work either (notice the lack of square brackets around mytree.ds
- xr.concat
expects a list).
# also fails: # ds_concatenated = concat_datatrees([mytree.two_digits, mytree.three_digits], dim='digits')
This is more troubling, but I've realised why it doesn't work. Basically map_over_subtree
currently looks for positional and keyword arguments that are DataTree
objects, and iterates over the .ds
in them.
But I never thought to make map_over_subtree
understand lists of DataTree
objects... In general perhaps we should support Iterable[DataTree]
? Then this example should work.
We might imagine changing xr.concat
to automatically handle the map_over_subtree
part - I'll raise an issue for that too.
# works ds_concatenated = xr.concat([mytree[subtree].ds for subtree in mytree], dim="digits")
For now I think this is your only option.
Note that https://github.com/pydata/xarray/issues/9077 might affect this - if datatree becomes slightly less general then there could in theory be some netcdf files that cannot be opened as a DataTree
containing ManifestArray
objects. You could still open the groups individually and combine, and also any virtual datatree that would become forbidden is also one that you wouldn't be able to open later for analysis as a normal loadable datatree.
Also this issue wouldn't really be closed until we also have the ability to do dt.to_kerchunk/.to_zarr
- i.e. we need a VirtualDataTreeAccessor
in addition to our existing VirtualDatasetAccessor
, which can write out the references for all the groups at once into one kerchunk file/zarr store.
EDIT: Serialization of DataTree objects is tracked separately in #244
We should support generating references from files containing multiple groups in the same way that
xr.open_dataset
anddatatree.open_datatree
work.So we should add a new
open_virtual_datatree
function, and a new (optional)group
kwarg toopen_virtual_dataset
.This can be done right now using the datatree package (as an optional dependency imported inside
open_virtual_datatree
) but once that gets merged into xarraymain
(which is happening right now) we can get rid of that dependency.See https://github.com/TomNicholas/VirtualiZarr/issues/78#issuecomment-2059737479 and #11.
cc @sharkinsspatial