whole-tale / girder_wholetale

Girder plugin providing basic Whole Tale functionality
BSD 3-Clause "New" or "Revised" License
3 stars 5 forks source link

Data Lookup Failing #176

Closed ThomasThelen closed 5 years ago

ThomasThelen commented 5 years ago

It looks like after the Globus integration merge the /repository/lookup endpoint is mis-behaving.

To Reproduce:

dataId: {"dataId": "urn:uuid:05d9dab8-5783-498d-a257-2b94da4dbe14" } base_url: https://cn.dataone.org/cn/v2

Error:

{
  "message": "Exception: Exception('Failed to interpret \"dataId\" in any meaningful way',)",
  "trace": [
    "<FrameSummary file /girder/girder/api/rest.py, line 630 in endpointDecorator>",
    "<FrameSummary file /girder/girder/api/rest.py, line 1210 in GET>",
    "<FrameSummary file /girder/girder/api/rest.py, line 967 in handleRoute>",
    "<FrameSummary file /girder/girder/api/describe.py, line 709 in wrapped>",
    "<FrameSummary file /girder/plugins/wholetale/server/rest/repository.py, line 58 in lookupData>",
    "<FrameSummary file /girder/plugins/wholetale/server/lib/null_provider.py, line 15 in lookup>"
  ],
  "type": "internal"
}

Because we can't locate the datasets, data registration is failing. Note that I tested this on dev2 which doesn't have the update and got it to work.

Xarthisius commented 5 years ago

Consensus on #165 was that we will no longer support raw identifiers that are not unique. However, using them with something that indicates a provider should still work. Try:

dataId = ["https://search.dataone.org/view/urn:uuid:05d9dab8-5783-498d-a257-2b94da4dbe14"]
ThomasThelen commented 5 years ago

This means that anyone trying to register a dataset that doesn't include the full URL won't be able to bring their data in, which would be a big regression of previous behavior. Your example above 100% works, but I think a number of people will be trying to register their data by using an identifier.

Example: Someone comes into wholetale with an identifier, doi:10.5063/F12805V3. Their data won't be found unless they use https://search.dataone.org/view/doi:10.5063/F12805V3 instead.

Edit: After chatting with Kacper it looks like my example works in the dashboard. Continuing to investigate....

Just for the sake of tracking examples (maybe edge cases) where the full URI works but the DataONE identifier doesn't.

Yes - https://search.dataone.org/view/ess-dive-77b46fa58849483-20181114T175016467 No - ess-dive-77b46fa58849483-20181114T175016467 No - doi:10.15485/1464233 (doi of the above)

Yes - https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/363/19 No - https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/363/19

Yes - https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/edi/192/3 No - https://pasta.lternet.edu/package/metadata/eml/edi/192/3

If this is the case, @craig-willis, let me know if you want me to update the user documentation.

Xarthisius commented 5 years ago

dataId = ["doi:10.5063/F12805V3"] should also work.

craig-willis commented 5 years ago

What about 10.5065/D6862DM8 (without the doi: protocol)? This worked previously and is in the quickstart documentation, but no longer appears to. Note that https://citation.crosscite.org/ allows this, but of course assumes DOI at all times.

craig-willis commented 5 years ago

It seems that we should 1) do some validation at the UI end that the provided ID is valid and/or 2) propagate errors from the backend.

Xarthisius commented 5 years ago

doi:10.15485/1464233

This one is tricky: it resolves to https://www.osti.gov/servlets/purl/1464233/, which in turn returns 302 to https://data.ess-dive.lbl.gov/#view/doi:10.15485/1464233

https://pasta.lternet.edu

It's not listed as MN node by CN, how do we know it's DataONE?

ThomasThelen commented 5 years ago

This can be located with the DataONE CN resolve endpoint, which is the preferable way of locating resources as opposed to using the member node api. https://cn.dataone.org/cn/v2/resolve/doi:10.15485/1464233

Resolves to

https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-77b46fa58849483-20181114T175016467

We list out the nodes here and have ess-dive listed second to last. I think it's unreliable to try to match the Base URL parameter with the dataset location. It works in the case of ess-dive, but not with other services (https://doi.pangaea.de/10.1594/PANGAEA.895994 does not share the DataONE Base URL of https://pangaea-orc-1.dataone.org/mn `

Xarthisius commented 5 years ago

This can be located with the DataONE CN resolve endpoint, which is the preferable way of locating resources as opposed to using the member node api. https://cn.dataone.org/cn/v2/resolve/doi:10.15485/1464233

Resolves to

https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-77b46fa58849483-20181114T175016467

Yeah, I see data.ess-dive.lbl.gov, it's pasta.lternet.edu that's not listed

Xarthisius commented 5 years ago

Just noting that none of these work in current prod env (using old lookup framework):

Others I'll address in PR shortly.

ThomasThelen commented 5 years ago

I don't think that pasta.lternet.edu is a DataONE member node since we list the LTER MN as https://gmn.lternet.edu/mn.

In the case of https://search.dataone.org/view/https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/363/19 we're using a URL as the identifier of the resource (we can't assume that this is where the package actually lives), and using resolve we can see that the location of this package is on the LTER MN listed on the node endpoint. This is super confusing because we can also get to the resource by visiting https://pasta.lternet.edu/package/metadata/eml/knb-lter-bnz/363/19, but AFAIK that page is not using the DataONE API.

The issue is (which we've confirmed in the PR) is that there isn't a common identifier format on the DataONE side; we have no way of telling that a URL is actually an identifier belonging to DataONE. The only way I can see a resolution to this is by sending out a query to the resolve endpoint and seeing if we get a hit.