zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
68 stars 10 forks source link

KeyError: 'refs' opening TIFF file #160

Closed scottyhq closed 1 week ago

scottyhq commented 1 week ago
import kerchunk.tiff
from virtualizarr import open_virtual_dataset

url = 'https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/lcmap_tiny_cog_2020.tif'
kerchunk.tiff.tiff_to_zarr(url)

vds = open_virtual_dataset(url, reader_options={})
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 vds = open_virtual_dataset(url, reader_options={})

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:114, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
    106 else:
    107     # this is the only place we actually always need to use kerchunk directly
    108     # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
    109     vds_refs = kerchunk.read_kerchunk_references_from_file(
    110         filepath=filepath,
    111         filetype=filetype,
    112         reader_options=reader_options,
    113     )
--> 114     virtual_vars = virtual_vars_from_kerchunk_refs(
    115         vds_refs,
    116         drop_variables=drop_variables + loadable_variables,
    117         virtual_array_class=virtual_array_class,
    118     )
    119     ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})
    120     coord_names = ds_attrs.pop("coordinates", [])

File ~/GitHub/VirtualiZarr/virtualizarr/xarray.py:239, in virtual_vars_from_kerchunk_refs(refs, drop_variables, virtual_array_class)
    222 def virtual_vars_from_kerchunk_refs(
    223     refs: KerchunkStoreRefs,
    224     drop_variables: list[str] | None = None,
    225     virtual_array_class=ManifestArray,
    226 ) -> Mapping[str, xr.Variable]:
    227     """
    228     Translate a store-level kerchunk reference dict into aaset of xarray Variables containing virtualized arrays.
    229 
   (...)
    236         Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that.
    237     """
--> 239     var_names = kerchunk.find_var_names(refs)
    240     if drop_variables is None:
    241         drop_variables = []

File ~/GitHub/VirtualiZarr/virtualizarr/kerchunk.py:154, in find_var_names(ds_reference_dict)
    151 def find_var_names(ds_reference_dict: KerchunkStoreRefs) -> list[str]:
    152     """Find the names of zarr variables in this store/group."""
--> 154     refs = ds_reference_dict["refs"]
    155     found_var_names = [key.split("/")[0] for key in refs.keys() if "/" in key]
    156     return found_var_names

KeyError: 'refs'
TomNicholas commented 1 week ago

Thanks for raising this @scottyhq! I don't think anyone has actually tried to open a tiff file with virtualizarr before!

Running your example, the output of kerchunk.tiff.tiff_to_zarr(url) looks like

{
  '.zgroup': '{\n "zarr_format": 2\n}',
  '.zattrs': '{"multiscales":[{"datasets":[{"path":"0"},{"path":"1"},{"path":"2"}],"metadata":{},"name":"","version":"0.1"}],"OVR_RESAMPLING_ALG":"NEAREST","LAYOUT":"IFDS_BEFORE_DATA","BLOCK_ORDER":"ROW_MAJOR","BLOCK_LEADER":"SIZE_AS_UINT4","BLOCK_TRAILER":"LAST_4_BYTES_REPEATED","KNOWN_INCOMPATIBLE_EDITION":"NO","KeyDirectoryVersion":1,"KeyRevision":1,"KeyRevisionMinor":0,"GTModelTypeGeoKey":1,"GTRasterTypeGeoKey":1,"GTCitationGeoKey":"Albers","GeographicTypeGeoKey":4326,"GeogCitationGeoKey":"WGS 84","GeogAngularUnitsGeoKey":9102,"GeogSemiMajorAxisGeoKey":6378140.0,"GeogInvFlatteningGeoKey":298.256999999996,"ProjectedCSTypeGeoKey":32767,"ProjectionGeoKey":32767,"ProjCoordTransGeoKey":11,"ProjLinearUnitsGeoKey":9001,"ProjStdParallel1GeoKey":29.5,"ProjStdParallel2GeoKey":45.5,"ProjNatOriginLongGeoKey":-96.0,"ProjNatOriginLatGeoKey":23.0,"ProjFalseEastingGeoKey":0.0,"ProjFalseNorthingGeoKey":0.0,"ModelPixelScale":[30.0,30.0,0.0],"ModelTiepoint":[0.0,0.0,0.0,-1801185.0,2700405.0,0.0]}',
  '0/.zattrs': '{\n "_ARRAY_DIMENSIONS": [\n  "Y",\n  "X"\n ]\n}',
  '0/.zarray': '{\n "chunks": [\n  512,\n  512\n ],\n "compressor": {\n  "id": "zlib"\n },\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  2048,\n  2048\n ],\n "zarr_format": 2\n}',
  ...,
}

It looks like this is not the same structure that e.g. kerchunk.hdf.SingleHdf5ToZarr returns.

What virtualizarr is expecting (and what the kerchunk docs promise...) is that the keys of the outermost dictionary are 'refs' and 'version'. This kerchunk.tiff.tiff_to_zarr(url) function seems to have jumped straight to giving us the contents that would normally be underneath the 'refs' key.

We could either fix this upstream in kerchunk, or just work around it here by special-casing tiffs to add that top-level {'refs': ...} ourselves. I vote for the latter.

scottyhq commented 1 week ago

Makes sense @TomNicholas, seems like if someone is motivated an upstream fix is the right approach. I just came across this working on https://github.com/zarr-developers/VirtualiZarr/pull/143 so figured I'd document it. For what it's worth the it's the same situation currently with FITS:

url = 'https://fits.gsfc.nasa.gov/samples/WFPC2u5780205r_c0fx.fits'
kerchunk.fits.process_file(url)
TomNicholas commented 1 week ago

Yeah thanks for documenting it!

We should raise an issue upstream to report it, but as long as there are no other differences in the structure then working around it here should be very simple.