Appending to references on disk

zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax

https://virtualizarr.readthedocs.io/en/stable/api.html

Apache License 2.0

120 stars 22 forks source link

Appending to references on disk #21

Open TomNicholas opened 8 months ago

TomNicholas commented 8 months ago

Kerchunk has some support for appending to references stored on disk as parquet. This pattern makes sense when you have operational data that might be appended to with new days of data but you don't want to risk mutating the existing data.

Is there a sensible way to handle this kind of use case in the context of VirtualiZarr's API?

forrestfwilliams commented 6 months ago

I'd be curious to know if this is possible (or planned) as well!

TomNicholas commented 6 months ago

@forrestfwilliams as it's possible in kerchunk it should be possible here! The basic pattern to follow might look like xarray's to_zarr method when using the append_dim kwarg (see docs on modifying existing zarr stores).

Personally this is not the highest-priority feature for me, so if you were interested in it I would be happy to help you think it through / contribute here 😄

forrestfwilliams commented 6 months ago

Hey @TomNicholas thanks. I work on a project called ITS_LIVE which is monitoring glacier velocities for the entire globe, and we're trying to find a way to efficiently represent our entire archive of NETCDF files (hosted in an open S3 bucket) as a Zarr.

We're creating new NETCDF files everyday, so we'd like to find a way to use VirtualiZarr that does not involve re-creating the entire dataset every time.

@jhkennedy and @betolink (who are also on the ITS_LIVE) may also be interested in this issue.

TomNicholas commented 6 months ago

@forrestfwilliams cool!

This pattern of appending could become quite neat if we use zarr chunk manifests instead of kerchunk's format. See this comment https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-2093359295

TomNicholas commented 6 months ago

Instead of trying to append to the kerchunk parquet references on-disk, what we could do is simply re-open the existing references as a virtual dataset, find byte ranges in the new files using open_virtual_dataset, concatenate the new and the old together, and re-write out the complete set of references.

The advantage of this is that it works without trying to update part of the on-disk representation in-place (you simply re-write the entire thing in one go instead), and it doesn't require re-finding all the byte range information in all the files you already indexed. The disadvantage is that you are re-writing on-disk references that you already created. I think this could be a nice solution in the short term though.

To allow this we need a way to open existing kerchunk references as a virtual dataset, i.e.

vds = xr.open_virtual_dataset('kerchunk_refs.json', filetype='kerchunk_json')

which I've opened #118 to track.

TomNicholas commented 3 weeks ago

To allow this we need a way to open existing kerchunk references as a virtual dataset

This was added in #251!

The basic pattern to follow might look like xarray's .to_zarr method when using the append_dim kwarg

This would be a pain with Kerchunk-formatted references but with Icechunk it should be straightforward! We can simply add append_dim to virtualizarr's new .to_icechunk method. See https://github.com/earth-mover/icechunk/issues/104#issuecomment-2375303136 for more explanation.

This would be incredibly useful but it's not something I personally need right now, so if someone wants to have a crack at adding append_dim in the meantime then please go for it and I will advise 🙂

rsignell commented 3 weeks ago

Whoa, this will be awesome! Maybe I can find someone to do this!