xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
162 stars 43 forks source link

Slicing/selecting datree nodes using coordinates similar to xarray.dataset.sel (?) #244

Open aladinor opened 1 year ago

aladinor commented 1 year ago

Hi everyone,

I want to share with you an idea I am working on. I have a collection of tasks/sweeps of weather radar data that are not homogenous in size/shape. Radar data structure depends on radar parameters such as maximum range, angle elevation of the antenna, and bin size, among others. Unstructured data is complex to archive, especially when we want to concatenate measurements over time, as matrices are different in shape data that cannot be stacked. Therefore, a tree-like hierarchical structure is a reliable/possible way to handle this unstructured data.

I've created a xarray datatree object where the parent is the radar name, and every node corresponds to a volume coverage pattern (VCP). A VCP is shaped of multiple tasks/sweeps, and each sweep has different radar data, such as radar reflectivity, Kdp, and Zdr, among other variables. The xarray datatree object looks like this

DataTree('None', parent=None)
│   Dimensions:   (time: 6)
│   Coordinates:
│     * time      (time) datetime64[ns] 2023-04-07T03:00:28.846000 ... 2023-04-07...
│   Data variables:
│       vcp_time  (time) <U12 ...
├── DataTree('202304070300')
│   ├── DataTree('PRECA')
│   │   │   Dimensions:              ()
│   │   │   Data variables:
│   │   │       altitude             float64 ...
│   │   │       instrument_type      <U5 ...
│   │   │       latitude             float64 ...
│   │   │       longitude            float64 ...
│   │   │       platform_type        <U5 ...
│   │   │       time_coverage_end    <U20 ...
│   │   │       time_coverage_start  <U20 ...
│   │   │       volume_number        int64 ...
│   │   │   Attributes:
│   │   │       Conventions:      None
│   │   │       comment:          im/exported using xradar
│   │   │       history:          None
│   │   │       institution:      None
│   │   │       instrument_name:  None
│   │   │       references:       None
│   │   │       source:           None
│   │   │       title:            None
│   │   │       version:          None
│   │   ├── DataTree('sweep_0')
│   │   │       Dimensions:            (azimuth: 360, range: 747)
│   │   │       Coordinates:
│   │   │           altitude           float64 ...
│   │   │         * azimuth            (azimuth) float64 0.05493 1.057 2.038 ... 358.1 359.0
│   │   │           latitude           float64 ...
│   │   │           longitude          float64 ...
│   │   │         * range              (range) float32 1e+03 1.3e+03 ... 2.245e+05 2.248e+05
│   │   │           spatial_ref        int64 ...
│   │   │           x                  (azimuth, range) float64 ...
│   │   │           y                  (azimuth, range) float64 ...
│   │   │           z                  (azimuth, range) float64 ...
│   │   │       Data variables: (12/19)
│   │   │           DBTH               (azimuth, range) float32 ...
│   │   │           DBZH               (azimuth, range) float32 ...
│   │   │           DB_DBTE8           (azimuth, range) int16 ...
│   │   │           DB_DBZE8           (azimuth, range) int16 ...
│   │   │           DB_HCLASS          (azimuth, range) int16 ...
│   │   │           KDP                (azimuth, range) float32 ...
│   │   │           ...                 ...

If we look in a more condensed way, the groups within the xarray datratree object look like this,

['/',
 '/202304070300',
 '/202304070300/SURVP',
 '/202304070300/SURVP/sweep_0',
 '/202304070300/PRECA',
 '/202304070300/PRECA/sweep_0',
 '/202304070300/PRECA/sweep_1',
 '/202304070300/PRECA/sweep_2',
 '/202304070300/PRECA/sweep_3',
 '/202304070300/PRECB',
 '/202304070300/PRECB/sweep_0',
 '/202304070300/PRECB/sweep_1',
 '/202304070300/PRECC',
 '/202304070300/PRECC/sweep_0',
 '/202304070300/PRECC/sweep_1',
 '/202304070300/PRECC/sweep_2',
 '/202304070305',
 '/202304070305/SURVP',
 '/202304070305/SURVP/sweep_0',
 '/202304070305/PRECA',
 '/202304070305/PRECA/sweep_0',
 '/202304070305/PRECA/sweep_1',
 '/202304070305/PRECA/sweep_2',
 '/202304070305/PRECA/sweep_3',
 '/202304070305/PRECB',
 '/202304070305/PRECB/sweep_0',
 '/202304070305/PRECB/sweep_1',
 '/202304070305/PRECC',
 '/202304070305/PRECC/sweep_0',
 '/202304070305/PRECC/sweep_1',
 '/202304070305/PRECC/sweep_2'
]

Usually, a radar scan strategy or VCP doesn't take longer than 5 minutes; however, it might change depending on the institution/radar operator. The idea under this xarray datatree object is to efficiently access radar data by naming each node with a timestamp that represents the 5-min VCP. Then, each node contains multiple tasks, sweeps, and variables. More detailed information can be found here

As we can see, the datatree object has time coordinates. However, when I try dt.isel(time=0) I got the following error

ValueError                                Traceback (most recent call last)
Cell In[64], line 1
----> 1 dt.isel(time=0)

File ~/miniconda3/envs/xradar/lib/python3.9/site-packages/datatree/mapping.py:208, in map_over_subtree.<locals>._map_over_subtree(*args, **kwargs)
    196 node_kwargs_as_datasets = dict(
    197     zip(
    198         [k for k in kwargs_as_tree_length_iterables.keys()],
   (...)
    203     )
    204 )
    206 # Now we can call func on the data in this particular set of corresponding nodes
    207 results = (
--> 208     func(*node_args_as_datasets, **node_kwargs_as_datasets)
    209     if not node_of_first_tree.is_empty
    210     else None
    211 )
    213 # TODO implement mapping over multiple trees in-place using if conditions from here on?
    214 out_data_objects[node_of_first_tree.path] = results

File ~/miniconda3/envs/xradar/lib/python3.9/site-packages/xarray/core/dataset.py:2431, in Dataset.isel(self, indexers, drop, missing_dims, **indexers_kwargs)
   2427     return self._isel_fancy(indexers, drop=drop, missing_dims=missing_dims)
   2429 # Much faster algorithm for when all indexers are ints, slices, one-dimensional
   2430 # lists, or zero or one-dimensional np.ndarray's
-> 2431 indexers = drop_dims_from_indexers(indexers, self.dims, missing_dims)
   2433 variables = {}
   2434 dims: dict[Hashable, int] = {}

File ~/miniconda3/envs/xradar/lib/python3.9/site-packages/xarray/core/utils.py:859, in drop_dims_from_indexers(indexers, dims, missing_dims)
    857     invalid = indexers.keys() - set(dims)
    858     if invalid:
--> 859         raise ValueError(
    860             f"Dimensions {invalid} do not exist. Expected one or more of {dims}"
    861         )
    863     return indexers
    865 elif missing_dims == "warn":
    866     # don't modify input

ValueError: Dimensions {'time'} do not exist. Expected one or more of Frozen({})

I wonder if xarray datatree objects can be sliced by using similar methods as in Xarray dataset, e.g., dt.sel(time=slice('202304070300', '202304070400'), such as it will return the nodes/subtrees I am interested in.

Please let me know your thoughts and comments.

Cheers,

Alfonso

TomNicholas commented 12 months ago

Hi @aladinor - sorry for missing this!

Am I correct in saying that this is the same issue as https://github.com/xarray-contrib/datatree/issues/67?