zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
95 stars 18 forks source link

Rewrite paths in a manifest #130

Closed TomNicholas closed 3 months ago

TomNicholas commented 4 months ago

One minor feature that might be useful would be a convenience method for rewriting the string paths in a manifest. The use case is if you move or rename underlying files, you don't want to have to regenerate all the byte ranges when you could just use the same ones and edit the paths.

This is therefore related to #118, as you could open previously-written kerchunk references, change the paths, and then re-save them, without having to find byte ranges again.

I'm imagining adding API something like this:

class Manifest:
    ...

    def rename_paths(
        new: str | Callable[str, str],
    ) -> Manifest:
        """
        Rename paths to chunks in this manifest.

        Accepts either a string, in which case this new path will be used for all chunks, or 
        a function which accepts the old path and returns the new path.

        Parameters
        ----------
        new
            New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

        Returns
        -------
        manifest

        Examples
        --------
        Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

        >>> def local_to_s3_url(old_local_path: str) -> str:
        ...     from pathlib import Path
        ...
        ...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
        ...
        ...     filename = Path(old_local_path).name
        ...     return str(new_s3_bucket_url / filename)

        >>> manifest.rename_paths(local_to_s3_url)
        """
        ...

This method would be implemented on Manifest, but also present on ManifestArray and on the VirtualiZarrDatasetAccessor.

The option to set all chunks to have the same path might not be particularly useful, though perhaps more so if we support indexing (https://github.com/TomNicholas/VirtualiZarr/issues/51).