Generating references without kerchunk

TomNicholas commented 7 months ago

VirtualiZarr + zarr chunk manifests re-implement so much of kerchunk that the only part left is kerchunk's backends - the part that actually generates the byte ranges from a given legacy file. It's interesting to imagine whether we could make virtualizarr work without using kerchunk or fsspec at all.

https://github.com/TomNicholas/VirtualiZarr/issues/61#issuecomment-2047826810 discusses how the rust object-store crate might allow us to read actual bytes from zarr v3 stores with chunk manifests over S3, without using fsspec.

The other place we use fsspec (+ kerchunk) is to generate the references in the first place. But can we imagine alternative implementations for generating that byte range information?

Arguments for doing this without using kerchunk + fsspec are essentially:

increased reliability
clearer interfaces
not using an overly complex tool (i.e. fsspec, which can read from all sorts of systems) to read from just two places (local or S3)
possible performance increases during reference generation (though this is unlikely to be a major bottleneck)
separation of concerns - if we can find other libraries that generate the byte range information already for their own purposes (e.g. [h5py]() using the ros3 driver, hidefix, or cog3pio), we might be able to avoid bearing that maintainance burden, which would be great.

TomNicholas commented 7 months ago

From @sharkinsspatial:

I just had a quick look at using hidefix to support ChunkManifest generation without kerchunk. Digging in a bit more detail, IIUC hidefix is still using the core HDF5 library to generate an Index object which can then be used to bypass the HDF5 concurrency limitations. As we are only iterating over the chunk offsets and not actually reading any data, this doesn’t provide us any advantage over h5py. Given that, I’ll try to focus on just creating an h5py based ChunkManifest lib for the near term.

sharkinsspatial commented 7 months ago

I took a quick, first look at Virtualizarr last night (amazing 🎊 , thank you for pushing this forward). I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .

A few questions for creating a PR for this.

kerchunk's SingleHdf5ToZarr will generate corresponding Zarr groups for HDF5 groups in a file. I may be misunderstanding the open_virtual_dataset logic but it is not currently attempting to build Dataset containers representing nested HDF5 groups in a file correct? If I am understanding this correctly, is this something we do want to support or should variables in nested HDF5 groups be flattened.
As this PR would only replace kerchunk use for ChunkManifest generation for HDF5 files, we would still require the existing logic for using KerchunkStoreRefs for other formats. What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?

TomNicholas commented 7 months ago

I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .

That would be awesome, thanks @sharkinsspatial ❤️ . I think whilst we're doing this we should try hard here to improve test coverage and understanding of behaviour in nasty cases. I'm thinking about #38 in particular.

is this something we do want to support or should variables in nested HDF5 groups be flattened.

This is totally equivalent to the xarray.Dataset vs xarray.DataTree correspondence. We should actually add a open_virtual_datatree function which opens all the groups, and add a (optional) group kwarg to open_virtual_dataset. I'll make a new issue for groups now. (See also https://github.com/TomNicholas/VirtualiZarr/issues/11)

What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?

I think what you're suggesting is the least instrusive way. The main thing is to keep code that actually depends on the kerchunk library isolated from the rest of the code base. I would perhaps make a new readers directory or something to distinguish it from the kerchunk.py wrapper of the kerchunk backends.

zarr-developers / VirtualiZarr

Generating references without kerchunk #78