zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
95 stars 18 forks source link

Allow other fsspec protocols than local and s3 #126

Closed TomAugspurger closed 4 months ago

TomAugspurger commented 4 months ago

This change simplifies the handling of filepaths in _fsspec_openfile_from_filepath, and removes some restrictions around what can be passed in. Most notably, it allows the use of non-S3 and local filepaths.

I see that virtualizarr/tests/test_xarray.py::test_anon_read_s3 covers this, but that's failing for me on main with

E           aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host carbonplan-share.s3.regionone.amazonaws.com:443 ssl:default [Name or service not known]

Is that just a local configuration issue for me, or is it failing for others as well?

norlandrhagen commented 4 months ago

Hey @TomAugspurger, I'm not getting that failure when running the tests on your branch, which is probably due to some AWS credential magic on my end. Updating the tests to run against minio or something similar would probably help with that..

jbusecke commented 4 months ago

Thanks for the PR @TomAugspurger.

I just tried this out and am running into some unexpected behavior:

I installed virtualizarr from this PR branc !pip install git+https://github.com/TomAugspurger/VirtualiZarr.git@user/tom/feature/filesystems

For the setup

from virtualizarr import open_virtual_dataset
from virtualizarr.kerchunk import FileType

urls = [
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc",
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_202001-202412.nc",
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_202501-202912.nc",
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_203001-203412.nc",
]

Then I tried to naively do this:

vds_list = []
for url in urls:
    vds = open_virtual_dataset(
        url, indexes={}
    )
    vds_list.append(vds)

which failed with this error

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:422](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py#line=421), in HTTPFileSystem._info(self, url, **kwargs) 420 try: 421 info.update( --> 422 await _file_info( 423 self.encode_url(url), 424 size_policy=policy, 425 session=session, 426 **self.kwargs, 427 **kwargs, 428 ) 429 ) 430 if info.get("size") is not None: File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:831](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py#line=830), in _file_info(url, session, size_policy, **kwargs) 830 elif size_policy == "get": --> 831 r = await session.get(url, allow_redirects=ar, **kwargs) 832 else: File [/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/client.py:978](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/client.py#line=977), in ClientSession.get(self, url, allow_redirects, **kwargs) 976 """Perform HTTP GET request.""" 977 return _RequestContextManager( --> 978 self._request(hdrs.METH_GET, url, allow_redirects=allow_redirects, **kwargs) 979 ) TypeError: ClientSession._request() got an unexpected keyword argument 'key' The above exception was the direct cause of the following exception: FileNotFoundError Traceback (most recent call last) Cell In[13], line 4 2 vds_list = [] 3 for f in tqdm(urls): ----> 4 vds = open_virtual_dataset( 5 f, indexes={} 6 ) 7 vds_list.append(vds) 9 #, reader_options={} File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/xarray.py:108](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/xarray.py#line=107), in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options) 102 return open_virtual_dataset_from_v3_store( 103 storepath=filepath, drop_variables=drop_variables, indexes=indexes 104 ) 105 else: 106 # this is the only place we actually always need to use kerchunk directly 107 # TODO avoid even reading byte ranges for variables that will be dropped later anyway? --> 108 vds_refs = kerchunk.read_kerchunk_references_from_file( 109 filepath=filepath, 110 filetype=filetype, 111 reader_options=reader_options, 112 ) 113 virtual_vars = virtual_vars_from_kerchunk_refs( 114 vds_refs, 115 drop_variables=drop_variables + loadable_variables, 116 virtual_array_class=virtual_array_class, 117 ) 118 ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {}) File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/kerchunk.py:76](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/kerchunk.py#line=75), in read_kerchunk_references_from_file(filepath, filetype, reader_options) 60 """ 61 Read a single legacy file and return kerchunk references to its contents. 62 (...) 72 so ensure reader_options match selected Kerchunk reader arguments. 73 """ 75 if filetype is None: ---> 76 filetype = _automatically_determine_filetype( 77 filepath=filepath, reader_options=reader_options 78 ) 80 # if filetype is user defined, convert to FileType 81 filetype = FileType(filetype) File [/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/kerchunk.py:117](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/kerchunk.py#line=116), in _automatically_determine_filetype(filepath, reader_options) 113 def _automatically_determine_filetype( 114 *, filepath: str, reader_options: Optional[dict] = {} 115 ) -> FileType: 116 file_extension = Path(filepath).suffix --> 117 fpath = _fsspec_openfile_from_filepath( 118 filepath=filepath, reader_options=reader_options 119 ) 121 if file_extension == ".nc": 122 # based off of: https://github.com/TomNicholas/VirtualiZarr/pull/43#discussion_r1543415167 123 magic = fpath.read() File /srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/utils.py:58, in _fsspec_openfile_from_filepath(filepath, reader_options) 56 # using dict merge operator to add in defaults if keys are not specified 57 storage_options = protocol_defaults | storage_options ---> 58 fpath = fsspec.filesystem(protocol, **storage_options).open(filepath) 60 return fpath File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/spec.py:1298](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/spec.py#line=1297), in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs) 1296 else: 1297 ac = kwargs.pop("autocommit", not self._intrans) -> 1298 f = self._open( 1299 path, 1300 mode=mode, 1301 block_size=block_size, 1302 autocommit=ac, 1303 cache_options=cache_options, 1304 **kwargs, 1305 ) 1306 if compression is not None: 1307 from fsspec.compression import compr File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:361](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py#line=360), in HTTPFileSystem._open(self, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs) 359 kw["asynchronous"] = self.asynchronous 360 kw.update(kwargs) --> 361 size = size or self.info(path, **kwargs)["size"] 362 session = sync(self.loop, self.set_session) 363 if block_size and size: File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py:118](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py#line=117), in sync_wrapper..wrapper(*args, **kwargs) 115 @functools.wraps(func) 116 def wrapper(*args, **kwargs): 117 self = obj or args[0] --> 118 return sync(self.loop, func, *args, **kwargs) File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py:103](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py#line=102), in sync(loop, func, timeout, *args, **kwargs) 101 raise FSTimeoutError from return_result 102 elif isinstance(return_result, BaseException): --> 103 raise return_result 104 else: 105 return return_result File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py:56](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py#line=55), in _runner(event, coro, result, timeout) 54 coro = asyncio.wait_for(coro, timeout=timeout) 55 try: ---> 56 result[0] = await coro 57 except Exception as ex: 58 result[0] = ex File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:435](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py#line=434), in HTTPFileSystem._info(self, url, **kwargs) 432 except Exception as exc: 433 if policy == "get": 434 # If get failed, then raise a FileNotFoundError --> 435 raise FileNotFoundError(url) from exc 436 logger.debug("", exc_info=exc) 438 return {"name": url, "size": None, **info, "type": "file"} FileNotFoundError: http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc

TypeError: ClientSession._request() got an unexpected keyword argument 'key'

Makes me think that the protocol is not properly detected?

When I add reader_options={} it works as intended:

vds_list = []
for url in tqdm(urls):
    vds = open_virtual_dataset(
        url, indexes={}, reader_options={}
    )
    vds_list.append(vds)

I believe that the default values for reader options are basically invalidating this logic.

TomNicholas commented 4 months ago

Thanks for this contribution @TomAugspurger !

If @norlandrhagen is happy with this (including @jbusecke 's fix), then I am happy to merge it.

norlandrhagen commented 4 months ago

100% Thanks for the fixes @TomAugspurger and @jbusecke!

jbusecke commented 4 months ago

Oh weird. My test case is still failing after pulling from main? My fix might have not been sufficient?

TomNicholas commented 4 months ago

Damn - is there a test/reproducer for this issue @jbusecke (that you can raise in a new issue)?

jbusecke commented 3 months ago

Yeah I have that on my list, but very busy this week, so might have to push to next week. Please ping me as needed 😆

jbusecke commented 3 months ago

@TomNicholas see https://github.com/zarr-developers/VirtualiZarr/issues/135