Open maxrjones opened 6 months ago
I think this is a good idea, that I had a mental issue for already anyway 😅 One neat thing this woudl allow is testing writing to zarr stores as manifests by round-tripping.
Kerchunk's single_zarr
We could use this and it would certainly be the quickest way, but perhaps we might be better off just writing the function ourselves? Otherwise we would be starting with a zarr store, opening it using a dependency (kerchunk and fsspec), get the opened results back as a kerchunk reference dict, then immediately converting that reference dict to ManifestArrays
. Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays
immediately.
I suspect also that if we do it that way (the "zarr-native" way), we might later find that either we can import and use code from zarr-python
, or zarr-python can take inspiration from code we write here.
Actually wait I think I misunderstood what you were suggesting @maxrjones . There are two types of Zarr stores we could read byte ranges from:
1) Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr
as you suggested (although maybe we could change the implementation in the future.)
2) Zarr stores containing manifest.json
files. This is what my comment above was referring to. This would be the inverse operation to vds.virtualize.to_zarr()
.
Maybe we should just skip kerchunk and read the zarr json manually and create the
ManifestArrays
immediately.
When you say "read the zarr json manually" do you mean using zarr-python? I'm not sure how just loading the .zarray (V2) or zarr.json (V3) would work because it doesn't tell you which chunks are initialized.
Oops, didn't notice your comment before posting my last response.
Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)
I was referring to this case, which seems like simpler to implement right now. Although (2) is also important.
I was referring to this case, which seems like simpler to implement right now.
Yep! If you want to make a PR for case (1) then go for it!
Although (2) is also important.
Yeah, but maybe I'll just add that in as part of #45.
allowing manifests to replace consolidated metadata for V3.
@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.
allowing manifests to replace consolidated metadata for V3.
@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.
In my mind the datasets containing multiple manifests served the same purpose as consolidated metadata with the added bonus of including key mappings for individual arrays, and so dataset.to_dict(data=False)
would be a version of the manifests that accomplishes the same thing as consolidated metadata and could be used for V3. But I see now that you're talking about manifests only as the virtulal representation of a single array.
Thanks for this package! Just want to add i'm interested in this.
Started some work on kerchunk to make ZarrToZarr
have parity with SingleHdf5ToZarr
(https://github.com/fsspec/kerchunk/pull/442) and I thought this package could help in the interim.
@norlandrhagen and I were discussing creating virtual datasets from zarr stores earlier today (placeholder already in _automatically_determine_filetype). @TomNicholas what are your thoughts on trying out Kerchunk's single_zarr for this purpose? I think this could be a helpful step towards virtual concatenation of Zarr stores and allowing manifests to replace consolidated metadata for V3.