Virtual datasets from Zarr stores

zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax

https://virtualizarr.readthedocs.io/en/stable/api.html

Apache License 2.0

105 stars 22 forks source link

Virtual datasets from Zarr stores #63

Open maxrjones opened 6 months ago

maxrjones commented 6 months ago

@norlandrhagen and I were discussing creating virtual datasets from zarr stores earlier today (placeholder already in _automatically_determine_filetype). @TomNicholas what are your thoughts on trying out Kerchunk's single_zarr for this purpose? I think this could be a helpful step towards virtual concatenation of Zarr stores and allowing manifests to replace consolidated metadata for V3.

TomNicholas commented 6 months ago

I think this is a good idea, that I had a mental issue for already anyway 😅 One neat thing this woudl allow is testing writing to zarr stores as manifests by round-tripping.

Kerchunk's single_zarr

We could use this and it would certainly be the quickest way, but perhaps we might be better off just writing the function ourselves? Otherwise we would be starting with a zarr store, opening it using a dependency (kerchunk and fsspec), get the opened results back as a kerchunk reference dict, then immediately converting that reference dict to ManifestArrays. Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays immediately.

I suspect also that if we do it that way (the "zarr-native" way), we might later find that either we can import and use code from zarr-python, or zarr-python can take inspiration from code we write here.

TomNicholas commented 6 months ago

Actually wait I think I misunderstood what you were suggesting @maxrjones . There are two types of Zarr stores we could read byte ranges from:

1) Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)

2) Zarr stores containing manifest.json files. This is what my comment above was referring to. This would be the inverse operation to vds.virtualize.to_zarr().

maxrjones commented 6 months ago

Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays immediately.

When you say "read the zarr json manually" do you mean using zarr-python? I'm not sure how just loading the .zarray (V2) or zarr.json (V3) would work because it doesn't tell you which chunks are initialized.

maxrjones commented 6 months ago

Oops, didn't notice your comment before posting my last response.

Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)

I was referring to this case, which seems like simpler to implement right now. Although (2) is also important.

TomNicholas commented 6 months ago

I was referring to this case, which seems like simpler to implement right now.

Yep! If you want to make a PR for case (1) then go for it!

Although (2) is also important.

Yeah, but maybe I'll just add that in as part of #45.

jhamman commented 6 months ago

allowing manifests to replace consolidated metadata for V3.

@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.

maxrjones commented 6 months ago

allowing manifests to replace consolidated metadata for V3.

@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.

In my mind the datasets containing multiple manifests served the same purpose as consolidated metadata with the added bonus of including key mappings for individual arrays, and so dataset.to_dict(data=False) would be a version of the manifests that accomplishes the same thing as consolidated metadata and could be used for V3. But I see now that you're talking about manifests only as the virtulal representation of a single array.

raybellwaves commented 6 months ago

Thanks for this package! Just want to add i'm interested in this.

Started some work on kerchunk to make ZarrToZarr have parity with SingleHdf5ToZarr (https://github.com/fsspec/kerchunk/pull/442) and I thought this package could help in the interim.