Open TomNicholas opened 1 month ago
What should we do with the paths in kerchunk references? Are they are always meant as absolute? I guess we should assume they are absolute, unless they have ./
or ../
at the start? Would fsspec ever produce relative paths like that?
cc @martindurant
Are they are always meant as absolute?
They are always meant "as interpreted by the target filesystem". The nature of that filesystem might be implied by the protocol of a path alone, but commonly additional arguments are also required. This means, that relative paths do work if the target happens to be the local filesystem (file://), but I think of the other filesystems, only ssh supports this concept at all. I would not expect this to be meaningful for basically any practical case.
Note that the dir:// filesystem adds prefixes to URLs for any filesystem, if that's useful at all.
(I am happy to require absolute paths even if it makes some tests slightly more verbose)
Thanks @martindurant !
The nature of that filesystem might be implied by the protocol of a path alone, but commonly additional arguments are also required.
But is the nature of the filesystem explicitly recorded in the kerchunk references format anywhere? Obviously if the prefix is explicit (e.g. s3://
) then you know but otherwise?
that relative paths do work
I guess we should assume they are absolute, unless they have ./ or ../ at the start?
Would this approach work then?
(I am happy to require absolute paths even if it makes some tests slightly more verbose)
This might be helpful if the above approach doesn't work.
is the nature of the filesystem explicitly recorded in the kerchunk references format
No. The original intention was to have these in the "templates", but in practice, the remote_protocol, remote_options and fss arguments to ReferenceFileSystem are used (and often encoded in Intake prescriptions) in cases of ambiguity.
@sharkinsspatial and I discussed this earlier and we think that we need to add some kind of optional fs_root
kwarg to specify the location of the root of the fsspec filesystem that the relative paths were created on. (This is the same thing as specifying what to prefix the relative paths with to turn them into absolute paths.)
The user could then do
cwd = str(Path.cwd().as_uri())
vds = open_virtual_dataset('refs.json', reader='kerchunk', reader_options={'fs_root': cwd})
which would prefix any relative paths found to be relative to the current working directly on this system.
EDIT: I guess when running a kerchunk-based reader on an archival file we know the fsspec filesystem, so we can convert to absolute paths automatically... It's only a problem when ingesting previously-written kerchunk json/parquet reference files (cc @norlandrhagen ).
EDIT2: Also it's an issue when reading other sidecar references file formats, such as dmrpp @ayushnag
If fs_root
was not specified and a relative path found then an exception would be raised.
I am happy to require absolute paths
Changing kerchunk upstream actually wouldn't really help as we still want to be able to parse kerchunk references that were created with older versions of kerchunk.
The above comment seems to be correct, in that now I've not used fs_root
anywhere, and got the only remaining failures down to the tests for the kerchunk format reader and the dmr++ reader.
@TomNicholas the dmrpp failures may be here in the test since the filepath within the manifest is hardcoded. You could try updating that and let me know if that fixes it.
You could try updating that and let me know if that fixes it.
I would expect that to fix it - what I'm wondering about is whether or not existing DMR++ files in the wild generally contain absolute or relative paths.
I have seen both. Sometimes the file contains s3://data.nc
but more commonly if the path is s3://data.nc
, the file just contains <Dataset xmlns="http://xml.opendap.org/ns/DAP/4.0#" ... name="data.nc">
the file just contains
<Dataset xmlns="http://xml.opendap.org/ns/DAP/4.0#" ... name="data.nc">
And this does not tell me that the true path is an S3 path does it?
the file just contains
<Dataset xmlns="http://xml.opendap.org/ns/DAP/4.0#" ... name="data.nc">
And this does not tell me that the true path is an S3 path does it?
No, the true path is not known with only the dmrpp file contents
No, the true path is not known with only the dmrpp file contents
Thanks. So I was right - this is equivalent to reading kerchunk json/parquet in that sense.
Okay important question: Are we trying to support manifests with http URLs in them? Right now we have some tests which create virtual datasets containing http://
URLs as the path, e.g:
virtualizarr/tests/test_backend.py::TestReadFromURL::test_read_from_url[netcdf3-https://github.com/pydata/xarray-data/raw/master/air_temperature.nc-HDF5VirtualBackend]
But these tests (cc @scottyhq ) doesn't actually try to read the data back as loadable xarray variables. AFAIK Icechunk could not read data from a manifest containing http URLs (cc @mpiannucci ), but fsspec presumably could?
Handling this case in the manifest validation is possible but a bit annoying as cloudpathlib doesn't support http paths (https://github.com/drivendataorg/cloudpathlib/issues/455 - at least not yet https://github.com/drivendataorg/cloudpathlib/pull/468), and pathlib will incorrectly conclude that the http url is a relative posix path.
AFAIK Icechunk could not read data from a manifest containing http URLs
It's planned for Icechunk to support HTTP apparently.
Handling this case in the manifest validation is possible
I added support for HTTP in https://github.com/zarr-developers/VirtualiZarr/pull/243/commits/fefab909540ad790db811295c38e69f5b8190565
Are we trying to support manifests with http URLs in them? I added support for HTTP in https://github.com/zarr-developers/VirtualiZarr/commit/fefab909540ad790db811295c38e69f5b8190565
If I'm following correctly, I'd say yes! There are lots of datasets out there that are not in cloud buckets, but are on servers that support http range requests. Agreed that it would be nice if cloudpathlib handled http:// paths
This PR closes #242 at the data model level - all paths are coerced to absolute URIs (i.e.
file:///directory/test.nc
ors3://bucket/test.nc
) as they go into theManifest
.As this forbids constructing manifests using relative paths, it requires minor changes to many tests (e.g.
test.nc
->/test.nc
). It also will require slightly more invasive changes to any tests that involve kerchunk references.docs/releases.rst
api.rst
Sub-tasks:
.rename_paths
method automaticallyfs_root
option internallyreader_kwargs
to all backends (see https://github.com/zarr-developers/VirtualiZarr/pull/315)reader_options
(rename tofsspec_kwargs
?) (also see https://github.com/zarr-developers/VirtualiZarr/pull/315)fs_root
forkerchunk
anddmrpp
readers as an option toreader_kwargs
(requires https://github.com/zarr-developers/VirtualiZarr/pull/315)fs_root
kerchunk
reader (requires https://github.com/zarr-developers/VirtualiZarr/pull/317)dmrpp
readerfs_root
forkerchunk
anddmrpp
readersfs_root
automatically in other readers