Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr

weiji14 commented 4 years ago

Gathering some notes on how best to read multiple ICESat-2 ATL11 data (basically a point cloud) in a user friendly way, with metadata preserved!

TLDR: Be able to do xr.open_mfdataset("ATL11_*.h5", engine="zarr", ...).

Inspired by the blog post "Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library". Zarr is an amazing project, and I really like the .zmetadata json file which can be opened with a text editor and tell you stuff about the data. The dream would be to read HDF5 files in an out-of-core manner with Zarr like speed/abilities (through the .zmetadata pointer).

Jupyter notebook demo can be found at https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/master/coawst_3ways.ipynb. See also discussion thread at https://github.com/zarr-developers/zarr-python/issues/535 on "Using the Zarr library to read HDF5".

Main hurdles to get through, dependent on upstream, there's two 'separate' parts:

Reading a single HDF5 dataset via Zarr
- [x] On xarray - Need chunk_store argument to use Zarr to read HDF5 - wait for https://github.com/pydata/xarray/pull/3804
- [ ] On zarr (though it's partly a dependency problem for me to work out)
- [x] Upgrading numcodecs from 0.6.3 to 0.6.4 (#11 :heavy_check_mark:) and to 0.7.2 (#160 :heavy_check_mark:) required to use Zarr 2.4.0 or newer (24b691794761899891a1f18383eacbc6c5f03c52 :heavy_check_mark:), but fails due to compilation issues - wait for https://github.com/zarr-developers/numcodecs/pull/224 or use conda-forge's numcodecs package.
- [ ] Specifically, use this hdf5 branch of Zarr to read HDF5 files via Zarr (TODO!).
Reading multiple Zarr-like files via xarray/intake:
- [x] On xarray - Streamline opening multiple zarr files via xr.open_mfdataset - wait for https://github.com/pydata/xarray/pull/4187 / https://github.com/pydata/xarray/pull/4003
- [ ] On intake-xarray - intake.open_ndzarr will break with the above :point_up: - wait for https://github.com/intake/intake-xarray/issues/70

Current situation in that I do HDF5 -> Zarr conversion, and read from that. It would be nice to stick to the original HDF5 data source (though I might need to flatten the nested ICESat-2 ATL11 data structure). Note that I'm not necessarily after raw speed, I just prefer readability (i.e. having xarray's wonderful annotated metadata).

Other open Issues/Pull Requests:

https://github.com/zarr-developers/zarr-python/issues/321 - Provide offset for memory mapping / contiguous layout
https://github.com/zarr-developers/zarr-python/pull/377 - RFC: Optionally support memory-mapping DirectoryStore values
https://github.com/zarr-developers/zarr-python/issues/556 - File Chunk Store
https://github.com/zarr-developers/zarr-python/pull/546 - POC: Add FSStore

Blog posts:

You can tell I had way too many tabs open on my browser :laughing:

weiji14 commented 1 year ago

Putting down some notes on a potential HDF5 -> pandas.DataFrame direct conversion (that skips the intermediate xarray format) using the code at https://github.com/MAAP-Project/gedi-subsetter (thanks @chuckwondo for the pointer!).

H5DataFrame class at https://github.com/MAAP-Project/gedi-subsetter/blob/0.6.0/src/gedi_subset/h5frame.py#L10 which is subclassed from pandas.DataFrame, and is able to hold a single HDF5 group. After reading multiple groups, these DataFrames can then be concatenated row-wise (check?)
subset_hdf5 function at https://github.com/MAAP-Project/gedi-subsetter/blob/0.6.0/src/gedi_subset/gedi_utils.py#L139, which can subset a HDF5 file based on a geopandas.GeoDataFrame Area of Interest. See example usage.

Just some things to play with once I get some free time :slightly_smiling_face:

chuckwondo commented 1 year ago

Awesome! Regarding the subset_hdf5 function, that's specific to the structure of GEDI data files (in particular, in relation to the BEAM* top-level groups), so you wouldn't want to use it for non-GEDI data files. For non-GEDI data files, you can directly use H5DataFrame.

weiji14 commented 1 year ago

H5DataFrame works for ICESat-2 ATL03 - https://github.com/ICESAT-2HackWeek/h5cloud/pull/5 :tada: There are some small quirks (e.g. the need to access groups/variable via df["group/variable"] to get at the data), but it should work for ATL11 too :crossed_fingers:

We're actually working on some benchmarks over in that repo (e.g. https://github.com/ICESAT-2HackWeek/h5cloud/pull/9), and the H5DataFrame read method is looking to be ~4x faster than xarray's h5netcdf (and that's without considering the conversion from xarray.Dataset -> pd.DataFrame), so looking real promising!

weiji14 / deepicedrain

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr #100