Open ThomasThelen opened 4 years ago
To reproduce:
#!/usr/bin/env girder-shell
# -*- coding: utf-8 -*-
from girder.plugins.wholetale.lib.dataone.provider import DataOneImportProvider
from girder.plugins.wholetale.lib.entity import Entity
from girder.plugins.wholetale.lib.dataone import DataONELocations
uri = "https://knb.ecoinformatics.org/view/doi:10.5063/F1BV7DV0"
entity = Entity(uri, None)
entity["base_url"] = DataONELocations.prod_cn
data_map = DataOneImportProvider().lookup(entity)
yields:
Traceback (most recent call last):
File "<string>", line 11, in <module>
File "server/lib/dataone/provider.py", line 43, in lookup
dm = D1_lookup(entity.getValue(), entity['base_url'])
File "server/lib/dataone/register.py", line 244, in D1_lookup
package_pid = get_package_pid(path, base_url)
File "server/lib/dataone/register.py", line 210, in get_package_pid
pid = find_resource_pid(initial_pid, base_url)
File "server/lib/dataone/register.py", line 114, in find_resource_pid
base_url=base_url)
File "server/lib/dataone/register.py", line 156, in find_nonobsolete_resmaps
raise RestException('No results were found for identifier(s): {}.'.format(", ".join(pids)))
girder.exceptions.RestException: No results were found for identifier(s): resource_map_doi:10.5063/F1BV7DV0, urn:uuid:73cb0fbb-2ff2-452d-bc7b-d946968d1aad.
is the request that's going out to DataONE giving us 0 results.
I believe the one we should be sending is the following (note that the doi doesn't have the resource_map
string appended to it).
https://cn.dataone.org/cn/v2/query/solr/?q=identifier:(%22doi:10.5063/F1BV7DV0%22%20OR%20%22urn:uuid:73cb0fbb-2ff2-452d-bc7b-d946968d1aad%22)+AND+-obsoletedBy:*&fl=identifier&rows=1000&start=0&wt=json%20[
We get the doi pid with resource_map
appended when we call
https://cn.dataone.org/cn/v2/query/solr/?q=identifier:"doi%3A10.5063%2FF1BV7DV0"&fl=identifier,formatType,formatId,resourceMap&rows=1000&start=0&wt=json
I need to check out why we're using resource_map_doi:10.5063/F1BV7DV0
instead of doi:10.5063/F1BV7DV0
Here's the full story of what's happening. The DOI that I was attempting to register was obsoleted by a newer version. When packages have newer versions, we attempt to register the latest version. But. The latest version has a resource map that's private, which we can't deal with (because it's private).
From @mbjones, we shouldn't be trying to register the latest version and instead registering the one that was entered into the field.
The behaviour that was encountered here shouldn't happen after we remove the logic for locating new versions.
We will have to take care of the names in that case. For Zenodo we prepend the version number to the name of the dataset. See e.g.:
Is it possible to get info about number of preceding versions in DataONE, or is it just a linked list of "ObsoletedBy" and "Replaces"?
Right now its just the list of obsoletes/obsoletedBy adjacency pairs. But we were just talking today about the need to be able to provide full version chain metadata through our API without a client having to walk the chain. If that is critical to your implementation, we should discuss what you would need so we can incorporate it into the API. Note, of course, that we don't use a simple serial version numbers.
It's a doubly linked list with obsoletedBy
and obsoletes
in the system metadata.
Example:
Initial package which has its system metadata here is obsoleted by this package which has a system metadata doc here which is in turn obsoleted by this package with this system metadata document.
Looking at the system metadata document of the first package, which has two new versions, we can see it only point to the one above.
<obsoletedBy>urn:uuid:cd77eb64-a9bf-4989-af6a-15d3f981188c</obsoletedBy>
We can ask SOLR for the pid that obsoletes another pid. If SOLR gives us a result, we would run that same query on the new pid to see if it's obsoleted, and repeat. It's O(n), but long obsoletion chains aren't common. If we walk the obsoleteness chain backwards (looking for things that were obsoletedBy our package) we can get a count, which could be a version number
Appending some sort of a unique identifier to denote a version is purely a user facing change. On the backend we match our internal uuid with an external uuid, so it easy to detect that dataset A and dataset B differ, even though they have the same name. On the other hand, users only see the Catalog with the names and have to somehow know which one they want to pick.
While the "version" doesn't have to be a number corresponding to the position in the chain, I don't think we can use something as complicated as a urn:uuid.
There are only two requirements: it has to be unique and "pretty" :)
doi:10.5063/F1BV7DV0 is a DataONE doi that points to the link at the bottom of this issue. When attempting to register, we get a file that points us to the package landing page.
To reproduce:
https://knb.ecoinformatics.org/view/doi:10.5063/F1BV7DV0