zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
95 stars 18 forks source link

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open TomNicholas opened 3 months ago

TomNicholas commented 3 months ago

The reason I made this package is to handle one particularly challenging use case - the [C]Worthy mCDR Atlas - which I still haven't done. Once it's done I plan to write a blog post talking about it, and maybe add it as a usage example to this repository.

This dataset has some characteristics that make it really challenging to kerchunk/virtualize[^1]:

This dataset is therefore comparable to some of the largest datasets already available in Zarr (at least in terms of the number of chunks and variables, if not on-disk size), and is very similar to the pathological case described in #104

24MB per array means that even a really big store with 100 variables, each with a million chunks, still only takes up 2.4GB in memory - i.e. your xarray "virtual" dataset would be ~2.4GB to represent the entire store.

If we can virtualize this we should be able to virtualize most things 💪


To get this done requires many features to be implemented:

Additionally once zarr-python actually understands some kind of chunk manifest, I want to also go back and create an actual zarr store for this dataset. That will additionally require:

[^1]: In fact pretty much the only ways in which this dataset could be worse is if it had differences in encoding between netCDF files, variable-length chunks, or netCDF groups, but thankfully it has none of those 😅