zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
123 stars 24 forks source link

dmrpp root and nested group parsing fix #265

Closed ayushnag closed 2 weeks ago

ayushnag commented 1 month ago
ayushnag commented 3 weeks ago

@TomNicholas this is ready to be reviewed/merged. (cc: @danielfromearth)

One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes fill_val for a float dtype array. The dmrpp reader sets fill_val to None by default as the ZArray constructor also has. However since the kerchunk code has extra logic, otherwise identical virtual datasets have a different fill_val when opened with virtualizarr netcdf vs dmrpp. Perhaps that logic should be in the ZArray constructor or only added when writing to kerchunk

TomNicholas commented 2 weeks ago

One oddity is that I cannot test the dmrpp reader with real world NASA files because of this line in the kerchunk ZArray logic which changes fill_val for a float dtype array. The dmrpp reader sets fill_val to None by default as the ZArray constructor also has. However since the kerchunk code has extra logic, otherwise identical virtual datasets have a different fill_val when opened with virtualizarr netcdf vs dmrpp. Perhaps that logic should be in the ZArray constructor or only added when writing to kerchunk

Thanks for flagging that - I can't remember exactly what the idea was but it seems like a bug or at least some workaround specific to kerchunk. Would you mind raising a new issue to track this oddity?

danielfromearth commented 2 weeks ago

@ayushnag, this is awesome, and the results are looking great for TEMPO!

temp

Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :)

ayushnag commented 2 weeks ago

Glad to see it!

Yes, DataTree integration is on the roadmap. In fact, the function _split_groups() currently returns dict[Path, ET.Element] which matches up to DataTree.from_dict() (each ET.Element can be converted to an xr.Dataset)

@ayushnag, this is awesome, and the results are looking great for TEMPO!

temp

Future roadmap is to incorporate this with DataTree too, right, so subgroups are loaded together? :)

TomNicholas commented 2 weeks ago

Excellent thank you @ayushnag !

betolink commented 2 weeks ago

Great work @ayushnag!! I'm just coming back from PTO. I was going to ask, should we try to schedule a meeting with the OPeNDAP people? thanks for reviewing this @TomNicholas!