zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
68 stars 10 forks source link

NotImplementedError: Unrecognised file extension: .nc4 (and other common ones .h5, .hdf, .tif) #142

Closed scottyhq closed 1 week ago

scottyhq commented 3 weeks ago

I was just trying out VirtualiZarr with a file from NASA that has a .nc4 suffix. From experience, .h5, .hdf, .nc4 are all pretty common, and xr.open_dataset() handles these without explicitly pointing to any engine.

open_virtual_dataset('./data_raw/MERRA2/MERRA2_400.statD_2d_slv_Nx.20240101.nc4') # filetype='netcdf4' required

https://github.com/zarr-developers/VirtualiZarr/blob/18f195a788cc8e6ce204e35a00e08ab0ccfb589c/virtualizarr/kerchunk.py#L124

Of course at the end of the day the file name is arbitrary, but I could put in a PR to add a few more common ones for convenience if file_extension in ['.nc','.nc4','.hdf','.h5']: ?

TomNicholas commented 3 weeks ago

These aren't actually different formats right? Just the same format with a different naming convention?

Currently open_virtual_dataset dispatches to kerchunk.SingleHdf5ToZarr, so as long as kerchunk can understand these then this sounds good.

scottyhq commented 3 weeks ago

I think the only thing you have to watch out for is older .nc being netCDF v3 (which you do by checking the header bytes already), otherwise yes .nc4 is an HDF5 format subset:

Creating a netCDF-4/HDF5 file with netCDF-4 results in an HDF5 file. The features of netCDF-4 are a subset of the features of HDF5, so the resulting file can be used by any existing HDF5 application. (https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html)