Open jbusecke opened 7 months ago
Oh I got it! (this is from @mgrover1 s notebook)
ds = xr.open_dataset("reference://",
engine="zarr",
backend_kwargs={
"consolidated": False,
"storage_options": {
"fo": 'combined.json',
"remote_protocol": "s3",
"remote_options":{'anon':True},
},
}
)
ds
works brilliantly!
Honestly no clue what is happening here, but also not that important in the long term I guess hehe.
To me this indicates that somehow the required storage_options={'anon':True} is not properly passed.
That might well be the case.
I actually forgot we hadn't merged #67 yet - it would be great to have that tested and merged.
Once this is written as a zarr, will the need to pass storage options go away?
Once it's written using the chunk manifest specification, and zarr-python implements the same ZEP, then it will be read from S3 however zarr-python implements it. Which I think will be using the rust object-store
crate. I don't know anything about what options have to be passed to that.
Is there a way to not use fsspec to use the reference files at the moment?
You need fsspec to understand the reference files if they written out following the kerchunk format.
@jbusecke nice! If that works but engine='kerchunk'
doesn't work then presumably there is a bug with kerchunk's xarray backend...
I would double-check that you can load this data and that the values are as you expect (watch out for subtleties with encoding)...
Ok I tried this for loading:
from dask.diagnostics import ProgressBar
import xarray as xr
ds = xr.open_dataset("reference://",
engine="zarr",
chunks={},
backend_kwargs={
"consolidated": False,
"storage_options": {
"fo": 'combined.json',
"remote_protocol": "s3",
"remote_options":{'anon':True},
},
}
)
with ProgressBar():
da_plot = ds.uo.mean(['time', 'lev']).load()
da_plot.plot()
That seems totally fine to me on the loading side.
Do you have recommendations how to check the encoding in a comprehensive manner?
That seems totally fine to me on the loading side.
Great!
Do you have recommendations how to check the encoding in a comprehensive manner?
Well the source of truth here would be to open the files directly using xarray without using kerchunk, i.e. open_mfdataset
on the raw netCDFs.
Very cool to see a real world example @jbusecke!
Motivated to come up with a proof of concept until tomorrow for the ESGF conference I am at right now, I am trying to test Virtualizarr on real world CMIP6 data on s3 (a complex example for #61)
I am running the following:
This works until here, which is really phenomenal. Thanks for the great work here.
But when I try to read from the reference file
I get this error:
To me this indicates that somehow the required
storage_options={'anon':True}
is not properly passed.Adding
gets around that error but the opening never works. After waiting for 10 minutes I get this trace:
I might be misinterpreting this but this looks exactly like the trace of the 'pangeo-forge-rechuning-stall' issue (can't find the original issue right now).
I am def too tired to dig deeper but I am wondering a few things:
Super happy to keep working on this!