zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
93 stars 16 forks source link

Listing every format that could be represented as virtual zarr #218

Open TomNicholas opened 1 month ago

TomNicholas commented 1 month ago

Let's list all the file formats that could potentially be represented efficiently as "virtual zarr" - i.e. zarr + chunk manifests.

The important criteria here is that the format must store data in a small number of contiguous chunks, such that access using http range requests to object storage is efficient. This rules out some formats, for example I don't think we can efficiently access this format that @kmuehlbauer mentioned over in https://github.com/openradar/xradar/issues/187#issuecomment-2271327041:

file formats where variables are written interleaved within one chunk of data (eg: 100 bytes v1, 100 bytes v2, 100 bytes v3, 100 bytes v1, 100 bytes v2, 100 bytes v3, ...)? Is there something like strides available?

If we start thinking of Zarr as a "SuperFormat" (super as in superset, not as in super-duper), then this is the list of existing formats comprising that set of what can be referenced using chunk manifests (see https://github.com/zarr-developers/zarr-specs/issues/287).


Definitely can support:

Probably can support:

Maybe can support?

Probably can't support:

(The checkboxes indicate whether or not a working implementation already exists - going through kerchunks' in-memory format as an intermediate or creating a ManifestArray directly.)

cc @jhamman @d-v-b

maxrjones commented 1 month ago

Unfortunately based on https://gdal.org/user/virtual_file_systems.html#jpeg2000 JPEG2000 is likely in the 'probably can't support' category. I would've liked if these datasets could be virtualized, but they're all JPEG2000 for to optimize for the download to disk model :(

Another way to phrase this question, which may help the search, is which of the formats supported by GDAL's raster drivers can be virtualized?

martindurant commented 3 weeks ago

I like this issue! It's worth saying that anything kerchunk can chunk can be v-zarrred, right? In that repo, there are suggestions of other worthwhile formats, dicom and nifti (medical imaging) spring to mind. The latter is nice, but often whole-file-gzipped, the former is evil in the way that other 90s standards are evil, but extremely widespread.

norlandrhagen commented 3 weeks ago

... the former is evil in the way that other 90s standards are evil, but extremely widespread.

❤️

TomNicholas commented 3 weeks ago

anything kerchunk can chunk can be v-zarrred, right?

Yes, that's the idea. This function does kerchunk refs -> virtual dataset, and this function does virtual dataset -> kerchunk refs. Any additional kerchunk file readers can be called as another if...else... in here.