NetCDF files with many empty variables require large `drop_variables` declaration

ghidalgo3 commented 1 month ago

I'm not sure if this is a real issue or working as intended, but I wanted to get clarification on it. The GOES dataset's NetCDF files contain many non-array variables such that the call to virtualizarr.open_virtual_dataset needs a large number of drop_variables to work. All of the dropped variables have in common a chunk_shape of (), because they're not arrays.

If validate_chunk_shape skipped validating empty dicts, it would be possible to open this dataset without having to pass such a drop_variables list, but more importantly we could preserve variable metadata when we write a Zarr store for this file and others. Is there a reason VirtualiZarr shouldn't read empty variables like this?

Repro

import planetary_computer
import pystac_client
import virtualizarr as vz

catalog = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1/")
search = catalog.search(
    collections=["goes-cmi"],
    datetime=["2019-08-01T12:00:00Z", "2019-08-01T13:00:00Z"],
    query={"goes:image-type": {"eq": "FULL DISK"}},
)
items = sorted(search.item_collection(), key=lambda x: x.datetime)
item = items[0]
vds = vz.open_virtual_dataset(
    planetary_computer.sign(item.assets["CMIP_C15-nc"].href),
    # drop_variables=[
    #     "processing_parm_version_container",
    #     "earth_sun_distance_anomaly_in_AU",
    #     "goes_imager_projection",
    #     "esun",
    #     "algorithm_product_version_container",
    #     "std_dev_brightness_temperature",
    #     "maximum_focal_plane_temperature",
    #     "focal_plane_temperature_threshold_increasing",
    #     "focal_plane_temperature_threshold_exceeded_count",
    #     "y_image",
    #     "nominal_satellite_height",
    #     "outlier_pixel_count",
    #     "focal_plane_temperature_threshold_decreasing",
    #     "min_brightness_temperature",
    #     "planck_bc2",
    #     "nominal_satellite_subpoint_lon",
    #     "nominal_satellite_subpoint_lat",
    #     "valid_pixel_count",
    #     "planck_bc1",
    #     "mean_brightness_temperature",
    #     "planck_fk2",
    #     "percent_uncorrectable_GRB_errors",
    #     "geospatial_lat_lon_extent",
    #     "kappa0",
    #     "percent_uncorrectable_L0_errors",
    #     "t",
    #     "max_brightness_temperature",
    #     "x_image",
    #     "planck_fk1",
    #     "algorithm_dynamic_input_data_container",
    #     "total_number_of_points"],
    reader_options={},
)

TomNicholas commented 1 month ago

By "non-array" you mean scalar right?

This is definitely an edge case I haven't thought through in great detail.

I think the main question is whether one can have valid Zarr arrays with a chunk shape of (). If not then we might want to automatically load these variables instead, i.e. add them to loadable_variables.

ghidalgo3 commented 1 month ago

By "non-array" you mean scalar right?

I think so? I'm not sure how to translate between the NetCDF concept and a Zarr array, I think ultimately Zarr only deals in Arrays or Groups. Scalar is a good word for it though.

I think the main question is whether one can have valid Zarr arrays with a chunk shape of ().

You can in zarrv2. This is allowed:

import zarr
store = zarr.open("example.zarr", mode="w", shape=(), chunks=(), path="a")
store.attrs["important attribute"] = 42

Which results in this store:

❯ tree example.zarr -a
example.zarr
├── .zgroup
└── a
    ├── .zarray
    └── .zattrs

.zgroup:

{
    "zarr_format": 2
}

.zarray of a:

{
    "chunks": [],
    "compressor": null,
    "dtype": "<f8",
    "fill_value": 0.0,
    "filters": null,
    "order": "C",
    "shape": [],
    "zarr_format": 2
}

zattrs of a:

{
    "important attribute": 42
}

BUT This is not allowed in ZarrV3:

import zarr
zarr.open(store="example.zarr", mode="w", shape=(), chunk_shape=(), path="a")
File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/synchronous.py:45, in open(store, mode, zarr_version, zarr_format, path, **kwargs)
     36 def open(
     37     *,
     38     store: StoreLike | None = None,
   (...)
     43     **kwargs: Any,  # TODO: type kwargs as valid args to async_api.open
     44 ) -> Array | Group:
---> 45     obj = sync(
     46         async_api.open(
     47             store=store,
     48             mode=mode,
     49             zarr_version=zarr_version,
     50             zarr_format=zarr_format,
     51             path=path,
     52             **kwargs,
     53         )
     54     )
     55     if isinstance(obj, AsyncArray):
     56         return Array(obj)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/sync.py:92, in sync(coro, loop, timeout)
     89 return_result = next(iter(finished)).result()
     91 if isinstance(return_result, BaseException):
---> 92     raise return_result
     93 else:
     94     return return_result

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/sync.py:51, in _runner(coro)
     46 """
     47 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
     48 exception, the exception will be returned.
     49 """
     50 try:
---> 51     return await coro
     52 except Exception as ex:
     53     return ex

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/asynchronous.py:198, in open(store, mode, zarr_version, zarr_format, path, **kwargs)
    195     store_path = store_path / path
    197 try:
--> 198     return await open_array(store=store_path, zarr_format=zarr_format, **kwargs)
    199 except KeyError:
    200     return await open_group(store=store_path, zarr_format=zarr_format, **kwargs)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/asynchronous.py:868, in open_array(store, zarr_version, zarr_format, path, **kwargs)
    865         raise e
    867 # if array was not found, create it
--> 868 return await create(store=store, path=path, zarr_format=zarr_format, **kwargs)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/api/asynchronous.py:686, in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, dimension_separator, write_empty_chunks, zarr_version, zarr_format, meta_array, attributes, chunk_shape, chunk_key_encoding, codecs, dimension_names, **kwargs)
    683 if path is not None:
    684     store_path = store_path / path
--> 686 return await AsyncArray.create(
    687     store_path,
    688     shape=shape,
    689     chunks=chunks,
    690     dtype=dtype,
    691     compressor=compressor,
    692     fill_value=fill_value,
    693     exists_ok=overwrite,  # TODO: name change
    694     filters=filters,
    695     dimension_separator=dimension_separator,
    696     zarr_format=zarr_format,
    697     chunk_shape=chunk_shape,
    698     chunk_key_encoding=chunk_key_encoding,
    699     codecs=codecs,
    700     dimension_names=dimension_names,
    701     attributes=attributes,
    702     **kwargs,
    703 )

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/array.py:153, in AsyncArray.create(cls, store, shape, dtype, zarr_format, fill_value, attributes, chunk_shape, chunk_key_encoding, codecs, dimension_names, chunks, dimension_separator, order, filters, compressor, exists_ok)
    149     if compressor is not None:
    150         raise ValueError(
    151             "compressor cannot be used for arrays with version 3. Use bytes-to-bytes codecs instead."
    152         )
--> 153     return await cls._create_v3(
    154         store_path,
    155         shape=shape,
    156         dtype=dtype,
    157         chunk_shape=chunk_shape,
    158         fill_value=fill_value,
    159         chunk_key_encoding=chunk_key_encoding,
    160         codecs=codecs,
    161         dimension_names=dimension_names,
    162         attributes=attributes,
    163         exists_ok=exists_ok,
    164     )
    165 elif zarr_format == 2:
    166     if codecs is not None:

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/array.py:237, in AsyncArray._create_v3(cls, store_path, shape, dtype, chunk_shape, fill_value, chunk_key_encoding, codecs, dimension_names, attributes, exists_ok)
    227 if isinstance(chunk_key_encoding, tuple):
    228     chunk_key_encoding = (
    229         V2ChunkKeyEncoding(separator=chunk_key_encoding[1])
    230         if chunk_key_encoding[0] == "v2"
    231         else DefaultChunkKeyEncoding(separator=chunk_key_encoding[1])
    232     )
    234 metadata = ArrayV3Metadata(
    235     shape=shape,
    236     data_type=dtype,
--> 237     chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape),
    238     chunk_key_encoding=chunk_key_encoding,
    239     fill_value=fill_value,
    240     codecs=codecs,
    241     dimension_names=tuple(dimension_names) if dimension_names else None,
    242     attributes=attributes or {},
    243 )
    245 array = cls(metadata=metadata, store_path=store_path)
    247 await array._save_metadata(metadata)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/chunk_grids.py:51, in RegularChunkGrid.__init__(self, chunk_shape)
     50 def __init__(self, *, chunk_shape: ChunkCoordsLike) -> None:
---> 51     chunk_shape_parsed = parse_shapelike(chunk_shape)
     53     object.__setattr__(self, "chunk_shape", chunk_shape_parsed)

File ~/code/test/.venv/lib/python3.10/site-packages/zarr/common.py:145, in parse_shapelike(data)
    143 data_tuple = tuple(data)
    144 if len(data_tuple) == 0:
--> 145     raise ValueError("Expected at least one element. Got 0.")
    146 if not all(isinstance(v, int) for v in data_tuple):
    147     msg = f"Expected an iterable of integers. Got {type(data)} instead."

Maybe it will be fixed? https://github.com/zarr-developers/zarr-python/issues/1977

Setting these as loadable_variables did work, I think we can live with that for now.

Should VirtualiZarr automatically load scalar variables? I think it's a nice feature for users and it would make it unnecessary to add a long list of variables to the open_virtual_dataset call.

TomNicholas commented 1 month ago

Oh interesting, thanks for finding that @ghidalgo3 .

Scalar is a good word for it though.

Are these scalar though? By scalar I meant something with a single element, but these have zero elements!

Should VirtualiZarr automatically load scalar variables?

I think ideally we would not automatically choose to do that and instead 0-size zarr arrays would be supported upstream, but I can see why it would be nice for this edge case to just automatically work.

If validate_chunk_shape skipped validating empty dicts

This suggestion seems okay though? It's in line with what the behaviour of v2 is, and what v3 is supposed to do.

t-mcneely commented 1 month ago

Hi @TomNicholas! Just adding onto @ghidalgo3's GOES example where an empty "scalar" contains critical attributes for projections defined here. The variable of interest is goes_imager_projection, which is used to map latitude/longitude for some GOES files.

import fsspec
import xarray as xr

fs = fsspec.filesystem("https")
ds = xr.open_dataset(
    fs.open('https://goeseuwest.blob.core.windows.net/noaa-goes16/ABI-L2-MCMIPF/2019/244/12/OR_ABI-L2-MCMIPF-M6_G16_s20192441200194_e20192441209513_c20192441210005.nc')
)
ds['goes_imager_projection']

The projection information is stored in the attributes of this dummy variable. However, opening without including this in loadable_variables means those attributes are lost when the empty scalar is dropped.

As above, setting it as a loadable variable works fine:

import virtualizarr as vz

vds_loadable = vz.open_virtual_dataset(
    'https://goeseuwest.blob.core.windows.net/noaa-goes16/ABI-L2-MCMIPF/2019/244/12/OR_ABI-L2-MCMIPF-M6_G16_s20192441200194_e20192441209513_c20192441210005.nc',
    reader_options={},
    indexes={},
    loadable_variables=['goes_imager_projection'],
)
vds_loadable['goes_imager_projection']

If these aren't loaded, then downstream projection operations have to go back to the original NetCDF files to get projection information which was dropped. Was a befuddling issue I ran into when trying to work with the kerchunk stores written from the virtual dataset, so I wanted to add it as a concrete example where automatically loading such scalars (or at least warning that they're being ignored) might be a preferable behavior.

TomNicholas commented 1 month ago

Thanks for your input @t-mcneely ? So this is another vote in favour of automatically loading "scalar" variables?

t-mcneely commented 1 month ago

That'd be my personal preference, but in lieu of that I think printing a warning or similar when variables are automatically dropped would be a helpful behavior.

ghidalgo3 commented 1 month ago

Me too, and I'd be happy to implement in a PR unless there is a good reason not to load scalar values automatically.

TomNicholas commented 1 month ago

unless there is a good reason not to load scalar values automatically.

I guess the only reason not to is that once https://github.com/zarr-developers/zarr-python/issues/1977 is fixed (as it should be, else its a breaking change between v2 and v3) we will no longer need to load scalar values automatically, so we would be adding something to virtualizarr that we know we should remove later.

I think that's okay though - if you're happy to add a PR @ghidalgo3 we can think about what scenarios should cause a warning (or deprecation warning).

TomNicholas commented 1 month ago

ghidalgo3 commented 1 month ago

Wished I could have attended but I had a conflict :(, I'll try to join the next one.

I will tackle loading these scalars this week, because I think many of the NetCDF datasets on https://planetarycomputer.microsoft.com/ are going to have variables like this.

zarr-developers / VirtualiZarr

NetCDF files with many empty variables require large `drop_variables` declaration #194

Repro