Closed scottyhq closed 1 week ago
These aren't actually different formats right? Just the same format with a different naming convention?
Currently open_virtual_dataset
dispatches to kerchunk.SingleHdf5ToZarr
, so as long as kerchunk can understand these then this sounds good.
I think the only thing you have to watch out for is older .nc being netCDF v3 (which you do by checking the header bytes already), otherwise yes .nc4 is an HDF5 format subset:
Creating a netCDF-4/HDF5 file with netCDF-4 results in an HDF5 file. The features of netCDF-4 are a subset of the features of HDF5, so the resulting file can be used by any existing HDF5 application. (https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html)
I was just trying out VirtualiZarr with a file from NASA that has a .nc4 suffix. From experience,
.h5
,.hdf
,.nc4
are all pretty common, andxr.open_dataset()
handles these without explicitly pointing to any engine.open_virtual_dataset('./data_raw/MERRA2/MERRA2_400.statD_2d_slv_Nx.20240101.nc4')
# filetype='netcdf4' requiredhttps://github.com/zarr-developers/VirtualiZarr/blob/18f195a788cc8e6ce204e35a00e08ab0ccfb589c/virtualizarr/kerchunk.py#L124
Of course at the end of the day the file name is arbitrary, but I could put in a PR to add a few more common ones for convenience
if file_extension in ['.nc','.nc4','.hdf','.h5']:
?