xarray-contrib / datatree

WIP implementation of a tree-like hierarchical data structure for xarray.
https://xarray-datatree.readthedocs.io
Apache License 2.0
161 stars 43 forks source link

`open_datatree` performance #330

Open aladinor opened 2 months ago

aladinor commented 2 months ago

Hi everyone,

Today, I would like to open a discussion on the open_datatree method for opening some files stored in Zarr.

First, I would like to give some background information about what we (@kmuehlbauer and @mgrover1) are working on. We propose a hierarchical tree-like data model to store weather radar data following the FAIR principles (Findable, Accessible, Interoperable, Reusable). A Radar Volume Scan (RVS), comprising data collected through multiple cone-like sweeps at various elevation angles, often exceeds several megabytes in size every 5 to 10 minutes and is usually stored as individual files. Radar data storage involves proprietary formats that demand extensive input-output (I/O) operations, leading to prolonged computation times and high hardware requirements. In response, our study introduces a novel data model designed to address these challenges. Leveraging the Climate and Forecast conventions (CF) format-based FM301 hierarchical tree structure, endorsed by the World Meteorological Organization (WMO) and Analysis-Ready Cloud-Optimized (ARCO) formats, we aim to develop an open data model to arrange, manage, and store radar data in cloud-storage buckets efficiently.

Thus, the idea is to create a hierarchical tree as follows:

The root corresponds to the radar name, nodes for each sweep within the RVS, and a xarray.Dataset for each sweep. Our datatree now contains not only all sweeps for each RVS with the azimuth and range dimension but also a third dimension, vcp_time, which allows us to create a time series.

We used data from the Colombian radar network, which is available on this S3 bucket, and put it all together in this repo. Our xarray.datatree object has 11 nodes, one for the radar_parameter group, and the following ten are different sweep elevations ranging from 0 to 20 degrees, as follows.

Looking in detail at sweep_0 (0.5 degrees elevation angle), we can notice that we now have around two thousand 5-min measurements along the vcp_time dimension. We have time series!!!

However, we found out that as we append or increase our datasets along the vcp_time dimension, the opening/loading time takes longer. For example, in our case, two thousand measurements, which correspond to around ten days of 5-min consecutive measurements, took around 46.4 seconds.

Now, if we consider adapting this to longer periods of storage radar datasets (e.g., +10 years), this issue will become a weak spot in our data model.

We consider this data model revolutionary for radar data analysis since we can now perform sizeable computational analyses quickly. For example, we computed a Quasi-Vertical Profile analysis QVP, which took a few lines of code and about 5 seconds to execute.

We looked into the open_datatree method and found the function loops through each node, opening/loading the Dataset at each node. This sequential looping might be a potential reason for it. This could be improved, and we would happily lend a hand in it.

https://github.com/xarray-contrib/datatree/blob/0afaa6cc1d6800987d8b9c37a604dc0a8c68aeaa/datatree/io.py#L84C1-L94C36

Please let us know your thoughts.

kmuehlbauer commented 2 months ago

Thanks @aladinor for initiating this discussion.

As datatree is currently undergoing merge into xarray, spearheaded by @TomNicholas, @flamingbear, @owenlittlejohns and @eni-awowale it would be great to hear thoughts from their side on that particular topic.

If necessary we can arrange participation in the weekly datatree meeting to discuss.

Anyway, any guidance and suggestions how to move forward with our presented data layout is greatly appreciated.

Thanks a bunch!

TomNicholas commented 2 months ago

Hi guys! Thanks for your interest!

@kmuehlbauer is right, we're merging datatree in upstream, and I believe that what you're asking for is essentially covered by the "Generalize backends to support groups." bullet point already listed on https://github.com/pydata/xarray/issues/8572.

Whilst the others tagged above are working through those tasks in #8572, no-one has been assigned to that item yet, so if anyone here would like to help that would be amazing. You're welcome to come to the datatree meeting next week!

We looked into the open_datatree method and found the function loops through each node, opening/loading the Dataset at each node.

Yep, this implementation was only ever intended to be a proof-of-principle, it's not optimized at all.