Open TomNicholas opened 8 months ago
I'd be curious to know if this is possible (or planned) as well!
@forrestfwilliams as it's possible in kerchunk it should be possible here! The basic pattern to follow might look like xarray's to_zarr
method when using the append_dim
kwarg (see docs on modifying existing zarr stores).
Personally this is not the highest-priority feature for me, so if you were interested in it I would be happy to help you think it through / contribute here 😄
Hey @TomNicholas thanks. I work on a project called ITS_LIVE which is monitoring glacier velocities for the entire globe, and we're trying to find a way to efficiently represent our entire archive of NETCDF files (hosted in an open S3 bucket) as a Zarr.
We're creating new NETCDF files everyday, so we'd like to find a way to use VirtualiZarr
that does not involve re-creating the entire dataset every time.
@jhkennedy and @betolink (who are also on the ITS_LIVE) may also be interested in this issue.
@forrestfwilliams cool!
This pattern of appending could become quite neat if we use zarr chunk manifests instead of kerchunk's format. See this comment https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-2093359295
Instead of trying to append to the kerchunk parquet references on-disk, what we could do is simply re-open the existing references as a virtual dataset, find byte ranges in the new files using open_virtual_dataset
, concatenate the new and the old together, and re-write out the complete set of references.
The advantage of this is that it works without trying to update part of the on-disk representation in-place (you simply re-write the entire thing in one go instead), and it doesn't require re-finding all the byte range information in all the files you already indexed. The disadvantage is that you are re-writing on-disk references that you already created. I think this could be a nice solution in the short term though.
To allow this we need a way to open existing kerchunk references as a virtual dataset, i.e.
vds = xr.open_virtual_dataset('kerchunk_refs.json', filetype='kerchunk_json')
which I've opened #118 to track.
To allow this we need a way to open existing kerchunk references as a virtual dataset
This was added in #251!
The basic pattern to follow might look like xarray's
.to_zarr
method when using theappend_dim
kwarg
This would be a pain with Kerchunk-formatted references but with Icechunk it should be straightforward! We can simply add append_dim
to virtualizarr's new .to_icechunk
method. See https://github.com/earth-mover/icechunk/issues/104#issuecomment-2375303136 for more explanation.
This would be incredibly useful but it's not something I personally need right now, so if someone wants to have a crack at adding append_dim
in the meantime then please go for it and I will advise 🙂
Whoa, this will be awesome! Maybe I can find someone to do this!
Kerchunk has some support for appending to references stored on disk as parquet. This pattern makes sense when you have operational data that might be appended to with new days of data but you don't want to risk mutating the existing data.
Is there a sensible way to handle this kind of use case in the context of VirtualiZarr's API?