zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
95 stars 18 forks source link

Opening http remote files still not working with defaults #135

Closed jbusecke closed 3 months ago

jbusecke commented 3 months ago

Continuation of https://github.com/zarr-developers/VirtualiZarr/pull/126

Over at https://github.com/jbusecke/esgf-virtual-zarr-data-access I am maintaining a simple example how to virtualize CMIP6 datasets.

This is a minimal example to reproduce the bug:

from virtualizarr import open_virtual_dataset
url = "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc"
vds = open_virtual_dataset(
          url, indexes={},
          reader_options={}
       )

this works as intended, but if I leave out the reader_options={} it fails

from virtualizarr import open_virtual_dataset
url = "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc"
vds = open_virtual_dataset(
          url, indexes={},
       )
File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/implementations/http.py:361, in HTTPFileSystem._open(self, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs) 359 kw["asynchronous"] = self.asynchronous 360 kw.update(kwargs) --> 361 size = size or self.info(path, **kwargs)["size"] 362 session = sync(self.loop, self.set_session) 363 if block_size and size: File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/asyn.py:118, in sync_wrapper..wrapper(*args, **kwargs) 115 @functools.wraps(func) 116 def wrapper(*args, **kwargs): 117 self = obj or args[0] --> 118 return sync(self.loop, func, *args, **kwargs) File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs) 101 raise FSTimeoutError from return_result 102 elif isinstance(return_result, BaseException): --> 103 raise return_result 104 else: 105 return return_result File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout) 54 coro = asyncio.wait_for(coro, timeout=timeout) 55 try: ---> 56 result[0] = await coro 57 except Exception as ex: 58 result[0] = ex File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/implementations/http.py:435, in HTTPFileSystem._info(self, url, **kwargs) 432 except Exception as exc: 433 if policy == "get": 434 # If get failed, then raise a FileNotFoundError --> 435 raise FileNotFoundError(url) from exc 436 logger.debug("", exc_info=exc) 438 return {"name": url, "size": None, **info, "type": "file"} FileNotFoundError: http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc

I suspect that there is still something going wrong with the default keywords logic, as I raised back here?

Wondering if we should add some end to end tests with small files on every supported filesystem?

norlandrhagen commented 3 months ago

Wondering if we should add some end to end tests with small files on every supported filesystem?

The fsspec.open feels very inconsistent. Ryan went on a super deep dive looking into behavior across https/local etc.

TomNicholas commented 3 months ago

The fsspec.open feels very inconsistent. Ryan went on a super deep dive looking into behavior across https/local etc.

God how annoying. Literally what is the point of fsspec if it doesn't actually specify a common behaviour across different filesystems? It really all is a symptom of this https://github.com/fsspec/filesystem_spec/issues/1446.


What can we do about it in virtualizarr though? We could try to test all the filesystems we might care about (again fsspec should really be doing that...) We could maintain our own little open context manager that actually has consistent behaviour. Or perhaps we could use some other learnings from pangeo-forge?

TomAugspurger commented 3 months ago

I suspect that there is still something going wrong with the default keywords logic, as I raised back https://github.com/zarr-developers/VirtualiZarr/pull/126#issuecomment-2146059544?

The problem is in virtualizarr.kerchunk.read_kerchunk_references_from_file: https://github.com/zarr-developers/VirtualiZarr/blob/87221ea769ddc3a2971b118f8dace9159690a0cb/virtualizarr/kerchunk.py#L56-L58

That's defining reader_options with s3-specific values.

It's technically API-breaking, but I'd recommend just setting that to Optional[dict[str, Any]] = None. I don't know whether virtualizarr has a sufficiently better opinion for the default arguments that the backend (s3fs / adlfs / etc), but probably not.

TomNicholas commented 3 months ago

Oh thanks @TomAugspurger ! I thought I fixed that bug in a commit I added to #126 but I obviously missed another occurrence of the default s3-specific values.

It would still be great to have tests that this all works on the main types of storage we care about.

jbusecke commented 3 months ago

It would still be great to have tests that this all works on the main types of storage we care about.

I like this idea, but maybe I should take a backseat, since I have the tendency to always write massive end-to-end tests 😆.