Open TomNicholas opened 7 months ago
From @sharkinsspatial:
I just had a quick look at using
hidefix
to supportChunkManifest
generation withoutkerchunk
. Digging in a bit more detail, IIUChidefix
is still using the coreHDF5
library to generate anIndex
object which can then be used to bypass theHDF5
concurrency limitations. As we are only iterating over the chunk offsets and not actually reading any data, this doesn’t provide us any advantage overh5py
. Given that, I’ll try to focus on just creating anh5py
basedChunkManifest
lib for the near term.
I took a quick, first look at Virtualizarr last night (amazing 🎊 , thank you for pushing this forward). I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifest
s .
A few questions for creating a PR for this.
kerchunk
's SingleHdf5ToZarr
will generate corresponding Zarr groups for HDF5 groups in a file. I may be misunderstanding the open_virtual_dataset
logic but it is not currently attempting to build Dataset
containers representing nested HDF5 groups in a file correct? If I am understanding this correctly, is this something we do want to support or should variables in nested HDF5 groups be flattened.kerchunk
use for ChunkManifest
generation for HDF5 files, we would still require the existing logic for using KerchunkStoreRefs
for other formats. What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset
?I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .
That would be awesome, thanks @sharkinsspatial ❤️ . I think whilst we're doing this we should try hard here to improve test coverage and understanding of behaviour in nasty cases. I'm thinking about #38 in particular.
is this something we do want to support or should variables in nested HDF5 groups be flattened.
This is totally equivalent to the xarray.Dataset
vs xarray.DataTree
correspondence. We should actually add a open_virtual_datatree
function which opens all the groups, and add a (optional) group
kwarg to open_virtual_dataset
. I'll make a new issue for groups now. (See also https://github.com/TomNicholas/VirtualiZarr/issues/11)
What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?
I think what you're suggesting is the least instrusive way. The main thing is to keep code that actually depends on the kerchunk
library isolated from the rest of the code base. I would perhaps make a new readers
directory or something to distinguish it from the kerchunk.py
wrapper of the kerchunk backends.
VirtualiZarr + zarr chunk manifests re-implement so much of kerchunk that the only part left is kerchunk's backends - the part that actually generates the byte ranges from a given legacy file. It's interesting to imagine whether we could make virtualizarr work without using kerchunk or fsspec at all.
https://github.com/TomNicholas/VirtualiZarr/issues/61#issuecomment-2047826810 discusses how the rust
object-store
crate might allow us to read actual bytes from zarr v3 stores with chunk manifests over S3, without using fsspec.The other place we use fsspec (+ kerchunk) is to generate the references in the first place. But can we imagine alternative implementations for generating that byte range information?
Arguments for doing this without using kerchunk + fsspec are essentially: