whole-tale / girder_wholetale

Girder plugin providing basic Whole Tale functionality
BSD 3-Clause "New" or "Revised" License
3 stars 5 forks source link

Data Lookup Fails for doi:10.5063/F1BV7DV0 #397

Open ThomasThelen opened 4 years ago

ThomasThelen commented 4 years ago

doi:10.5063/F1BV7DV0 is a DataONE doi that points to the link at the bottom of this issue. When attempting to register, we get a file that points us to the package landing page.

To reproduce:

  1. Navigate to the Manage page
  2. Attempt to register the dataset with doi doi:10.5063/F1BV7DV0
  3. Note that you get a file instead of a folder

https://knb.ecoinformatics.org/view/doi:10.5063/F1BV7DV0

Xarthisius commented 4 years ago

To reproduce:

#!/usr/bin/env girder-shell
# -*- coding: utf-8 -*-

from girder.plugins.wholetale.lib.dataone.provider import DataOneImportProvider
from girder.plugins.wholetale.lib.entity import Entity
from girder.plugins.wholetale.lib.dataone import DataONELocations

uri = "https://knb.ecoinformatics.org/view/doi:10.5063/F1BV7DV0"
entity = Entity(uri, None)
entity["base_url"] = DataONELocations.prod_cn
data_map = DataOneImportProvider().lookup(entity)

yields:

Traceback (most recent call last):
  File "<string>", line 11, in <module>
  File "server/lib/dataone/provider.py", line 43, in lookup
    dm = D1_lookup(entity.getValue(), entity['base_url'])
  File "server/lib/dataone/register.py", line 244, in D1_lookup
    package_pid = get_package_pid(path, base_url)
  File "server/lib/dataone/register.py", line 210, in get_package_pid
    pid = find_resource_pid(initial_pid, base_url)
  File "server/lib/dataone/register.py", line 114, in find_resource_pid
    base_url=base_url)
  File "server/lib/dataone/register.py", line 156, in find_nonobsolete_resmaps
    raise RestException('No results were found for identifier(s): {}.'.format(", ".join(pids)))
girder.exceptions.RestException: No results were found for identifier(s): resource_map_doi:10.5063/F1BV7DV0, urn:uuid:73cb0fbb-2ff2-452d-bc7b-d946968d1aad.
ThomasThelen commented 4 years ago

https://cn.dataone.org/cn/v2/query/solr/?q=identifier:(%22resource_map_doi:10.5063/F1BV7DV0%22%20OR%20%22urn:uuid:73cb0fbb-2ff2-452d-bc7b-d946968d1aad%22)+AND+-obsoletedBy:*&fl=identifier&rows=1000&start=0&wt=json%20[

is the request that's going out to DataONE giving us 0 results.

I believe the one we should be sending is the following (note that the doi doesn't have the resource_map string appended to it). https://cn.dataone.org/cn/v2/query/solr/?q=identifier:(%22doi:10.5063/F1BV7DV0%22%20OR%20%22urn:uuid:73cb0fbb-2ff2-452d-bc7b-d946968d1aad%22)+AND+-obsoletedBy:*&fl=identifier&rows=1000&start=0&wt=json%20[

We get the doi pid with resource_map appended when we call https://cn.dataone.org/cn/v2/query/solr/?q=identifier:"doi%3A10.5063%2FF1BV7DV0"&fl=identifier,formatType,formatId,resourceMap&rows=1000&start=0&wt=json

I need to check out why we're using resource_map_doi:10.5063/F1BV7DV0 instead of doi:10.5063/F1BV7DV0

ThomasThelen commented 4 years ago

Here's the full story of what's happening. The DOI that I was attempting to register was obsoleted by a newer version. When packages have newer versions, we attempt to register the latest version. But. The latest version has a resource map that's private, which we can't deal with (because it's private).

From @mbjones, we shouldn't be trying to register the latest version and instead registering the one that was entered into the field.

The behaviour that was encountered here shouldn't happen after we remove the logic for locating new versions.

Xarthisius commented 4 years ago

We will have to take care of the names in that case. For Zenodo we prepend the version number to the name of the dataset. See e.g.:

https://girder.stage.wholetale.org/#collection/596793f2ebde2c0001b03dbe/folder/5e3879ff8bec16c2663dd119

Is it possible to get info about number of preceding versions in DataONE, or is it just a linked list of "ObsoletedBy" and "Replaces"?

mbjones commented 4 years ago

Right now its just the list of obsoletes/obsoletedBy adjacency pairs. But we were just talking today about the need to be able to provide full version chain metadata through our API without a client having to walk the chain. If that is critical to your implementation, we should discuss what you would need so we can incorporate it into the API. Note, of course, that we don't use a simple serial version numbers.

ThomasThelen commented 4 years ago

It's a doubly linked list with obsoletedBy and obsoletes in the system metadata.

Example:

Initial package which has its system metadata here is obsoleted by this package which has a system metadata doc here which is in turn obsoleted by this package with this system metadata document.

Looking at the system metadata document of the first package, which has two new versions, we can see it only point to the one above. <obsoletedBy>urn:uuid:cd77eb64-a9bf-4989-af6a-15d3f981188c</obsoletedBy>

We can ask SOLR for the pid that obsoletes another pid. If SOLR gives us a result, we would run that same query on the new pid to see if it's obsoleted, and repeat. It's O(n), but long obsoletion chains aren't common. If we walk the obsoleteness chain backwards (looking for things that were obsoletedBy our package) we can get a count, which could be a version number

Xarthisius commented 4 years ago

Appending some sort of a unique identifier to denote a version is purely a user facing change. On the backend we match our internal uuid with an external uuid, so it easy to detect that dataset A and dataset B differ, even though they have the same name. On the other hand, users only see the Catalog with the names and have to somehow know which one they want to pick.

While the "version" doesn't have to be a number corresponding to the position in the chain, I don't think we can use something as complicated as a urn:uuid.

There are only two requirements: it has to be unique and "pretty" :)