xpublish-community / xpublish-opendap

OpenDAP router for Xpublish
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

`IndexError` for .dods for certain datasets #59

Open sjordan29 opened 5 months ago

sjordan29 commented 5 months ago

Hi @abkfenris, I've been maintaining Catalog-To-Xpublish, and we're having some issues with the opendap response for one of the USGS datasets we're working with. Most datasets available do work, so this seems to be anomaly to work out with this dataset.

The Problem Dataset

We're having issues with PRISM_v2; you can view the xarray structure here: https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/

The Problem

When try to open with xarray like so:

OPENDAP_URL = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap'
xr_dataset = xr.open_dataset(
    OPENDAP_URL,
    engine='netcdf4',
)

I get the following error: IndexError: The indexing operation you are attempting to perform is not valid on netCDF4.Variable object. Try loading your data into memory first by calling .load().

Thoughts

Seems like this might have something to do with the tbnd dimension (which is not a coordinate) and/or its associated variable, time_bnds. Note that we don't see this issue when I just request ppt variable, which has dimensions of lat/lon/time.

## time_bnds
dods_url = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?time_bnds'
nc_dataset = nc.Dataset(dods_url)
# opening w/ xarray fails with IndexError:
# The indexing operation you are attempting to
# perform is not valid on netCDF4.Variable object.
xr_dataset = xr.open_dataset(
  xr.backends.NetCDF4DataStore(
    nc_dataset
  ),
  decode_times=False
)

## ppt
dods_url = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?ppt'
nc_dataset = nc.Dataset(dods_url)
# This works.
xr_dataset = xr.open_dataset(
  xr.backends.NetCDF4DataStore(
    nc_dataset
  ),
  decode_times=False
)

As I understand it, xpublish-opendap reformats xarray datasets using dap_xarray to a compatible format to leverageopendap-protocol . I'm wondering if there's something going on with some of these translations between data structures that might be causing this problem. Do you have any ideas of what might be going on here? Happy to help rig up a solve, but your understanding of these libraries would be super helpful to help me nail down the problem.

abkfenris commented 5 months ago

Hi @sjordan29 ,

I actually just ran into that same IndexError today while trying to work on https://github.com/xpublish-community/xpublish/discussions/246 . From the structure of your dataset, I'm guessing it's also an error with the coordinate attributes not currently being translated correctly into an OpenDAP representation, and Xarray is getting a little cranky when trying to open it.

Any chance you could open the original Zarr source locally, and slice it down to a minimum dataset that causes that error and share a NetCDF of that?

That way I can throw it into the test suite and compare what Xarray opens directly vs what it can load over OpenDAP.

sjordan29 commented 5 months ago

Thanks so much! I added a slice of the data and the notebook showing the error I'm getting on our fork here: https://github.com/LimnoTech/xpublish-opendap/tree/index-error/notebooks/dev_sandbox

aufdenkampe commented 5 months ago

@abkfenris, thanks for looking into this issue.

BTW, USGS just made their xpublish-builder repo public, which shows how we pull everything together to create the OpenDAP endpoints exposed at https://labs-beta.waterdata.usgs.gov/api/xpublish/catalogs. Our hope that this builder repo might be useful for you or others working on https://ioos.github.io/ioos-code-sprint/2024/topics/06-xpublish.html

abkfenris commented 5 months ago

Great, thank you both! I'll try to take a look later or tomorrow.

abkfenris commented 5 months ago

Hmm, it looks like we're (at least) generating different .dds compared to what THREDDS will generate for this dataset, so I'm going to go back to the basics to see how to generate these.

THREDDS DDS

Dataset {
    Float32 time_bnds[time = 2][tbnd = 2];
    Float32 time[time = 2];
} datasets/PRISM_v2_slice.nc;

xpublish DDS

Dataset {
    Float32 time[time = 2];
    Float64 tbnd[tbnd = 2];
    Grid {
      Array:
        String time_bnds[time = 2][tbnd = 2];
      Maps:
        Float32 time[time = 2];
        Float64 tbnd[tbnd = 2];
    } time_bnds;
} ds;
abkfenris commented 5 months ago

Ok, got a bit farther into the weeds here.

When Hyrax or THREDDS encounters a dataarray that has a non coordinate dimension, instead of returning a DAP Grid with the dimensions mapped, they instead return a DAP Array with multiple dimensions.

Currently the underlying OpenDAP library that we are using is hard coded to only respond with single dimension arrays that aren't Grids.

https://github.com/MeteoSwiss/opendap-protocol/blob/5d8b1261847c969cfa2eedc2a42b31c84b48035e/opendap_protocol/protocol.py#L443-L447

In #60 I've gone and added some extra encoding checks and a custom OpenDAP Array that encodes dimensions similarly to Hyrax and THREDDS. This seems to take care of IndexErrors in some cases.

I've included your PRISM dataset as test data and am running automated tests against it, to make sure we don't have a regression. Strangely with Py.test locally and in the Github Actions matrix it passes, but when tested locally with a nox matrix it fails due to IndexErrors so I'm even more confused.

It would be great if you all could try testing that PR and let me know if that does the trick.

Some related issues I found along the way:

sjordan29 commented 5 months ago

@abkfenris, thanks so much for digging into this! I will test #60 with PRISM in our Catalog-To-Xpublish implementation and let you know how it goes.

sjordan29 commented 5 months ago

Opening With xarray

Just took a look, and the following still gives me an IndexError

opendap_url = 'http://localhost:8000/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap'
xr.open_dataset(
    opendap_url,
    engine='netcdf4',
    decode_times=False,
)

However, I can query the time_bnds variable individually now, which did not work before:

dods_url = 'https://labs-beta.waterdata.usgs.gov/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?time_bnds'
xr_dataset = xr.open_dataset(
  dods_url,
  decode_times=False
)

image

I can query every single individual coordinate and variable on its own (and together in a comma separated list) except for the tbnd dimension, and the whole xarray will lazy load:

opendap_url = 'http://localhost:8000/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap?crs,ppt,tmn,tmx,time_bnds'`

data_array = xr.open_dataset(
    opendap_url,
    engine='netcdf4',
    decode_times=False,
)

image

So this feels really close. It's just that opendap_url = 'http://localhost:8000/api/xpublish/PRISM_v2/datasets/zarr-s3-osn/opendap' that still gives an IndexError

I do see the changes reflected in the dds and das response.

abkfenris commented 5 months ago

Can you try building a new environment? I found that I was sporadically getting those IndexErrors in testing when Nox resolved a different environment than I had locally or in Github Actions on this PR. I compared pip list outputs but didn't find any version differences that jumped out at me.

sjordan29 commented 4 months ago

Hi @abkfenris, just got to building a new environment -- apologies for the delay. Still getting the error. A couple notes:

  1. We're building Catalog-To-Xpublish in docker, so that environment re-builds every time I build the docker image from scratch. The environment we're using can be found here: https://code.usgs.gov/wma/nhgf/xpublish/-/tree/xpublish-opendap?ref_type=heads. Attaching the pip list from inside that docker container. docker-pip-list.txt

  2. Just rebuilt my local environment used when requesting against the service as well, and still encountered the same behavior as in my most recent comment. Attaching the pip list from that local environment as well. local-pip-list.txt