zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/stable/api.html
Apache License 2.0
123 stars 24 forks source link

Paths as URIs #243

Open TomNicholas opened 1 month ago

TomNicholas commented 1 month ago

This PR closes #242 at the data model level - all paths are coerced to absolute URIs (i.e. file:///directory/test.nc or s3://bucket/test.nc) as they go into the Manifest.

As this forbids constructing manifests using relative paths, it requires minor changes to many tests (e.g. test.nc-> /test.nc). It also will require slightly more invasive changes to any tests that involve kerchunk references.

Sub-tasks:

TomNicholas commented 1 month ago

What should we do with the paths in kerchunk references? Are they are always meant as absolute? I guess we should assume they are absolute, unless they have ./ or ../ at the start? Would fsspec ever produce relative paths like that?

cc @martindurant

martindurant commented 1 month ago

Are they are always meant as absolute?

They are always meant "as interpreted by the target filesystem". The nature of that filesystem might be implied by the protocol of a path alone, but commonly additional arguments are also required. This means, that relative paths do work if the target happens to be the local filesystem (file://), but I think of the other filesystems, only ssh supports this concept at all. I would not expect this to be meaningful for basically any practical case.

Note that the dir:// filesystem adds prefixes to URLs for any filesystem, if that's useful at all.

martindurant commented 1 month ago

(I am happy to require absolute paths even if it makes some tests slightly more verbose)

TomNicholas commented 1 month ago

Thanks @martindurant !

The nature of that filesystem might be implied by the protocol of a path alone, but commonly additional arguments are also required.

But is the nature of the filesystem explicitly recorded in the kerchunk references format anywhere? Obviously if the prefix is explicit (e.g. s3://) then you know but otherwise?

that relative paths do work

I guess we should assume they are absolute, unless they have ./ or ../ at the start?

Would this approach work then?

(I am happy to require absolute paths even if it makes some tests slightly more verbose)

This might be helpful if the above approach doesn't work.

martindurant commented 1 month ago

is the nature of the filesystem explicitly recorded in the kerchunk references format

No. The original intention was to have these in the "templates", but in practice, the remote_protocol, remote_options and fss arguments to ReferenceFileSystem are used (and often encoded in Intake prescriptions) in cases of ambiguity.

TomNicholas commented 4 days ago

@sharkinsspatial and I discussed this earlier and we think that we need to add some kind of optional fs_root kwarg to specify the location of the root of the fsspec filesystem that the relative paths were created on. (This is the same thing as specifying what to prefix the relative paths with to turn them into absolute paths.)

The user could then do

cwd = str(Path.cwd().as_uri())
vds = open_virtual_dataset('refs.json', reader='kerchunk', reader_options={'fs_root': cwd})

which would prefix any relative paths found to be relative to the current working directly on this system.

EDIT: I guess when running a kerchunk-based reader on an archival file we know the fsspec filesystem, so we can convert to absolute paths automatically... It's only a problem when ingesting previously-written kerchunk json/parquet reference files (cc @norlandrhagen ).

EDIT2: Also it's an issue when reading other sidecar references file formats, such as dmrpp @ayushnag

If fs_root was not specified and a relative path found then an exception would be raised.

I am happy to require absolute paths

Changing kerchunk upstream actually wouldn't really help as we still want to be able to parse kerchunk references that were created with older versions of kerchunk.

TomNicholas commented 4 days ago

The above comment seems to be correct, in that now I've not used fs_root anywhere, and got the only remaining failures down to the tests for the kerchunk format reader and the dmr++ reader.

ayushnag commented 4 days ago

@TomNicholas the dmrpp failures may be here in the test since the filepath within the manifest is hardcoded. You could try updating that and let me know if that fixes it.

TomNicholas commented 4 days ago

You could try updating that and let me know if that fixes it.

I would expect that to fix it - what I'm wondering about is whether or not existing DMR++ files in the wild generally contain absolute or relative paths.

ayushnag commented 4 days ago

I have seen both. Sometimes the file contains s3://data.nc but more commonly if the path is s3://data.nc, the file just contains <Dataset xmlns="http://xml.opendap.org/ns/DAP/4.0#" ... name="data.nc">

TomNicholas commented 4 days ago

the file just contains <Dataset xmlns="http://xml.opendap.org/ns/DAP/4.0#" ... name="data.nc">

And this does not tell me that the true path is an S3 path does it?

ayushnag commented 4 days ago

the file just contains <Dataset xmlns="http://xml.opendap.org/ns/DAP/4.0#" ... name="data.nc">

And this does not tell me that the true path is an S3 path does it?

No, the true path is not known with only the dmrpp file contents

TomNicholas commented 4 days ago

No, the true path is not known with only the dmrpp file contents

Thanks. So I was right - this is equivalent to reading kerchunk json/parquet in that sense.

TomNicholas commented 3 days ago

Okay important question: Are we trying to support manifests with http URLs in them? Right now we have some tests which create virtual datasets containing http:// URLs as the path, e.g:

virtualizarr/tests/test_backend.py::TestReadFromURL::test_read_from_url[netcdf3-https://github.com/pydata/xarray-data/raw/master/air_temperature.nc-HDF5VirtualBackend]

But these tests (cc @scottyhq ) doesn't actually try to read the data back as loadable xarray variables. AFAIK Icechunk could not read data from a manifest containing http URLs (cc @mpiannucci ), but fsspec presumably could?

Handling this case in the manifest validation is possible but a bit annoying as cloudpathlib doesn't support http paths (https://github.com/drivendataorg/cloudpathlib/issues/455 - at least not yet https://github.com/drivendataorg/cloudpathlib/pull/468), and pathlib will incorrectly conclude that the http url is a relative posix path.

TomNicholas commented 3 days ago

AFAIK Icechunk could not read data from a manifest containing http URLs

It's planned for Icechunk to support HTTP apparently.

Handling this case in the manifest validation is possible

I added support for HTTP in https://github.com/zarr-developers/VirtualiZarr/pull/243/commits/fefab909540ad790db811295c38e69f5b8190565

scottyhq commented 2 days ago

Are we trying to support manifests with http URLs in them? I added support for HTTP in https://github.com/zarr-developers/VirtualiZarr/commit/fefab909540ad790db811295c38e69f5b8190565

If I'm following correctly, I'd say yes! There are lots of datasets out there that are not in cloud buckets, but are on servers that support http range requests. Agreed that it would be nice if cloudpathlib handled http:// paths