zarr-developers / VirtualiZarr

Create virtual Zarr stores from archival data files using xarray syntax
https://virtualizarr.readthedocs.io/en/latest/
Apache License 2.0
95 stars 18 forks source link

Use cloudpathlib instead of fsspec? #172

Open TomNicholas opened 3 months ago

TomNicholas commented 3 months ago

AFAIK the only filesystems we need to read from are local and cloud, so could we just use pathlib and cloudpathlib?

TomNicholas commented 3 months ago

In particular we could just call cloudpathlib.AnyPath

https://cloudpathlib.drivendata.org/stable/anypath-polymorphism/

norlandrhagen commented 3 months ago

This would be really cool @TomNicholas!

Seems like it can read over s3 into xarray:


from cloudpathlib import CloudPath
import xarray as xr 
cloudpath = CloudPath("s3://carbonplan-share/air_temp.nc")
ds = xr.open_dataset(cloudpath)
norlandrhagen commented 3 months ago

A little more exploration. It looks like SingleHDFToZarr works both for s3 and local.


from kerchunk.hdf import SingleHdf5ToZarr
import io 
from cloudpathlib import CloudPath
import xarray as xr 
# from cloudpathlib import AnyPath

cloudpath = CloudPath("s3://carbonplan-share/air_temp.nc")

with open(cloudpath, 'rb') as f:
  contents = f.read()
  refs = SingleHdf5ToZarr(io.BytesIO(contents)).translate()
refs
TomNicholas commented 2 months ago

Some more thoughts - one way to smooth this transition would be to replace all uses of UPath (which is based on fsspec) with cloudpathlib's AnyPath. They are both very similar - for example they both implement a .stat method, which is used in https://github.com/zarr-developers/VirtualiZarr/pull/187/files#r1678802398.

The snag here is that I don't think cloudpathlib supports https...

TomNicholas commented 2 months ago

The snag here is that I don't think cloudpathlib supports https...

I raised https://github.com/drivendataorg/cloudpathlib/issues/455