zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
67 stars 10 forks source link

KeyError: '.zarray' with HDF5 data #159

Open scottyhq opened 1 week ago

scottyhq commented 1 week ago

I'm getting a KeyError Traceback for several different HDF5 files.

For example, trying to use this test file from kerchunk: https://github.com/fsspec/kerchunk/blob/ae692fead51a216691e4db9a67c99194c5ba8e14/kerchunk/tests/test_hdf.py#L307

# (or local path)
url = 'https://github.com/fsspec/kerchunk/raw/ae692fead51a216691e4db9a67c99194c5ba8e14/kerchunk/tests/NEONDSTowerTemperatureData.hdf5'
kerchunk.hdf.SingleHdf5ToZarr(url).translate()

# side note: also not sure why `reader_options={}` is required to read from the URL
vds = open_virtual_dataset(url, reader_options={})
File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:114, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
    106 else:
    107     # this is the only place we actually always need to use kerchunk directly
    108     # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
    109     vds_refs = kerchunk.read_kerchunk_references_from_file(
    110         filepath=filepath,
    111         filetype=filetype,
    112         reader_options=reader_options,
    113     )
--> 114     virtual_vars = virtual_vars_from_kerchunk_refs(
    115         vds_refs,
    116         drop_variables=drop_variables + loadable_variables,
    117         virtual_array_class=virtual_array_class,
    118     )
    119     ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
    120     coord_names = ds_attrs.pop("coordinates", [])

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:247, in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class)
    241     drop_variables = []
    242 var_names_to_keep = [
    243     var_name for var_name in var_names if var_name not in drop_variables
    244 ]
    246 vars = {
--> 247     var_name: variable_from_kerchunk_refs(refs, var_name, virtual_array_class)
    248     for var_name in var_names_to_keep
    249 }
    250 return vars

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:293, in variable_from_kerchunk_refs(refs, var_name, virtual_array_class)
    290 """Create a single xarray Variable by reading specific keys of a kerchunk references dict."""
    292 arr_refs = kerchunk.extract_array_refs(refs, var_name)
--> 293 chunk_dict, zarray, zattrs = kerchunk.parse_array_refs(arr_refs)
    295 manifest = ChunkManifest._from_kerchunk_chunk_dict(chunk_dict)
    297 # we want to remove the _ARRAY_DIMENSIONS from the final variables' .attrs

File ~/GitHub/VirtualiZarr/virtualizarr/kerchunk.py:186, in parse_array_refs(arr_refs)
    183 def parse_array_refs(
    184     arr_refs: KerchunkArrRefs,
    185 ) -> tuple[dict, ZArray, ZAttrs]:
--> 186     zarray = ZArray.from_kerchunk_refs(arr_refs.pop(".zarray"))
    187     zattrs = arr_refs.pop(".zattrs", {})
    188     chunk_dict = arr_refs

KeyError: '.zarray'
TomNicholas commented 1 week ago

This file has nested groups, which we don't support yet. See #84.

The error is because the current code for parsing kerchunk references essentially assumes that each group is an array (i.e. that there is only one group - the root). It errors when it doesn't find the array metadata.

scottyhq commented 1 week ago

It's surprisingly hard to find HDF5 files out in the wild that don't have groups! If you know of any for testing let me know @TomNicholas , otherwise it seems like the group kwarg open_virtual_dataset(..., group='xyz') would be a good solution. Often the groups are unrelated enough that you really only want to work with one anyways.

TomNicholas commented 1 week ago

Are you looking for HDF5 files that are not netCDF4 files? Because we could just point to the same file that we open as xarray's tutorial dataset?

We do need the group kwarg anyway. I think Ayush's DMR++ PR adds one for that particular backend.

On Tue, Jun 25, 2024, 6:16 PM Scott Henderson @.***> wrote:

It's surprisingly hard to find HDF5 files out in the wild that don't have groups! If you know of any for testing let me know @TomNicholas https://github.com/TomNicholas , otherwise it seems like the group kwarg open_virtual_dataset(..., group='xyz') would be a good solution. Often the groups are unrelated enough that you really only want to work with one anyways.

— Reply to this email directly, view it on GitHub https://github.com/zarr-developers/VirtualiZarr/issues/159#issuecomment-2190064611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AISNPI2ZDLZ67M4ZCMZ2YSLZJHT5TAVCNFSM6AAAAABJ4PVBQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJQGA3DINRRGE . You are receiving this because you were mentioned.Message ID: @.***>

scottyhq commented 1 week ago

Are you looking for HDF5 files that are not netCDF4 files?

Correct, I was. From the netCDF Docs: "Some HDF5 features not supported in netCDF-4 format include non-hierarchical group structures, HDF5 reference types, multiple links to a data object, user-defined atomic data types, stored property lists, more permissive rules for data object names, the HDF5 date/time type, and attributes associated with user-defined types"

Although, taking a step back I don't think it's worth seeking out examples of HDF5 that is not netCDF4 :) I'd guess many of the .h5 files out there actually conform to the netCDF4 subset and could've just as easily been named .nc ! Feel free to close this one as a duplicate of #84

forrestfwilliams commented 1 week ago

Just wanted to +1 this issue. I ran into it today while also trying to use VirtualiZarr on a netCDF file with multiple groups.

TomNicholas commented 1 week ago

@forrestfwilliams I'm happy to review a PR! Adding a group kwarg to open_virtual_dataset should be pretty simple - you would just read all the references using kerchunk as it does now, then select out only the part of the nested references dict that corresponds to that group. Then use variables_from_kerchunk_refs on just that.