zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
124 stars 24 forks source link

Store test datasets in repo #235

Open norlandrhagen opened 3 months ago

norlandrhagen commented 3 months ago

Adds a way to build test netcdf files and store them within the repo.

norlandrhagen commented 3 months ago

This test test_numpy_arrays_to_inlined_kerchunk_refs is failing.

refs["refs"]["lon/0"]
'base64:AABIQwCASkMAAE1DAIBPQwAAUkMAgFRDAABXQwCAWUMAAFxDAIBeQwAAYUMAgGNDAABmQwCAaEMAAGtDAIBtQwAAcEMAgHJDAAB1QwCAd0MAAHpDAIB8QwAAf0MAwIBDAACCQwBAg0MAgIRDAMCFQwAAh0MAQIhDAICJQwDAikMAAIxDAECNQwCAjkMAwI9DAACRQwBAkkMAgJNDAMCUQwAAlkMAQJdDAICYQwDAmUMAAJtDAECcQwCAnUMAwJ5DAACgQwBAoUMAgKJDAMCjQwAApUM='
TomNicholas commented 3 months ago

Great idea! We could make the file even smaller by down sampling spatially presumably.

The failure shows that the variable is not being inlined when it previously was being inlined. This makes sense - which variables are inclined in the test is set by kerchunk's inline_threshold kwarg. These are then compared against the variables manually specified with loadable_variables. The parameter values (500.0 etc) were just chosen to be bigger/smaller than certain variables in the Xarray test dataset. Now you've changed those variables they will be different sizes, so some will be inclined that were not previously. It's a janky setup but I wasn't sure how to do it more neatly because that's the only way kerchunk allows you to control inlining.

norlandrhagen commented 3 months ago

Thanks! Although, I can't take credit for the idea haha.

Late night head scratcher, I can't seem to find the right inline_threshold to satisfy both the lat and the time assert statements. 😕

# for loadable_vars = ['lat','lon']
# lat comparison fails below 101 inline_threshold
# time comparison fails above 16  inline_threshold
norlandrhagen commented 1 month ago

mypy errors seem to be unrelated: https://github.com/zarr-developers/VirtualiZarr/issues/249

TomNicholas commented 1 month ago

Looking at this again, I think storing the datasets as file is great for roundtrip tests, but we should also strive to make other tests start from a point that doesn't require reading netCDF (and hence doesn't rely on kerchunk). That could either be in-memory kerchunk references, or on-disk kerchunk references that we then intepret using #251, or maybe it could even be something simpler in some cases.