Closed Xarthisius closed 2 years ago
Sad that I didn't catch this eons ago. The fix works for D1, Zenodo, Dataverse.
My test case:
Create a new tale
Register and add files from each of the following:
Export tale, extract zip, fetch, validate as above:
$ bdbag --resolve-fetch all .
$ bdbag --validate full .
What are your thoughts on Globus/MDF? I can't even register http://dx.doi.org/doi:10.18126/M2301J right now:
Traceback (most recent call last):
File "/girder/girder/api/rest.py", line 630, in endpointDecorator
val = fun(self, path, params)
File "/girder/girder/api/rest.py", line 1230, in POST
return self.handleRoute(method, path, params)
File "/girder/girder/api/rest.py", line 970, in handleRoute
val = handler(**kwargs)
File "/girder/girder/api/access.py", line 63, in wrapped
return fun(*args, **kwargs)
File "/girder/girder/api/rest.py", line 445, in wrapped
val = fun(*args, **kwargs)
File "/girder/girder/api/describe.py", line 709, in wrapped
return fun(*args, **kwargs)
File "/girder/plugins/wholetale/server/rest/dataset.py", line 241, in importData
Job().scheduleJob(job)
File "/girder/plugins/jobs/server/models/job.py", line 346, in scheduleJob
events.trigger('jobs.schedule', info=job)
File "/girder/girder/events.py", line 314, in trigger
handler['handler'](e)
File "/girder/plugins/jobs/server/__init__.py", line 43, in scheduleLocal
fn(job)
File "/girder/plugins/wholetale/server/tasks/register_dataset.py", line 33, in run
importedData = register_dataMap(
File "/girder/plugins/wholetale/server/lib/__init__.py", line 107, in register_dataMap
objType, obj = provider.register(
File "/girder/plugins/wholetale/server/lib/import_providers.py", line 86, in register
for item in self._listRecursive(user, pid, name, base_url, progress=progress):
File "/girder/plugins/wholetale/server/lib/globus/globus_provider.py", line 98, in _listRecursive
tc = self.clients.getUserTransferClient(user)
File "/girder/plugins/globus_handler/server/clients.py", line 39, in getUserTransferClient
self.userClients[username] = TransferClient(authz)
TypeError: __init__() takes 1 positional argument but 2 were given
Globus unfortunately doesn't provide checksums. There might be a way to query them, but I'm still investigating.
Merging #518 (e9986ee) into master (2e2ae35) will decrease coverage by
0.57%
. The diff coverage is94.95%
.
@@ Coverage Diff @@
## master #518 +/- ##
==========================================
- Coverage 93.41% 92.84% -0.58%
==========================================
Files 54 54
Lines 4179 4265 +86
==========================================
+ Hits 3904 3960 +56
- Misses 275 305 +30
Impacted Files | Coverage Δ | |
---|---|---|
server/lib/dataone/register.py | 76.25% <ø> (-7.20%) |
:arrow_down: |
server/rest/tale.py | 96.44% <ø> (-0.04%) |
:arrow_down: |
server/lib/globus/globus_provider.py | 63.55% <66.66%> (-0.26%) |
:arrow_down: |
server/lib/exporters/__init__.py | 96.77% <85.71%> (-1.41%) |
:arrow_down: |
server/lib/dataverse/provider.py | 97.04% <92.30%> (-0.24%) |
:arrow_down: |
server/lib/manifest_parser.py | 79.05% <95.00%> (-14.70%) |
:arrow_down: |
server/__init__.py | 93.25% <100.00%> (+0.06%) |
:arrow_up: |
server/lib/dataone/provider.py | 100.00% <100.00%> (+3.17%) |
:arrow_up: |
server/lib/exporters/bag.py | 100.00% <100.00%> (ø) |
|
server/lib/manifest.py | 92.97% <100.00%> (+0.90%) |
:arrow_up: |
... and 7 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 1639360...e9986ee. Read the comment docs.
This has exposed how inadequate our testing has been for bags.
bdbag --validate full
ignores oxum, bdbag --validate fast
uses oxum, python -m bagit --validate
also uses oxum. Going forward, we should test all of these these variations and post-fetch. Also, to check the profile we need to use bdbag --validate-profile full
as a separate command.Responding to your points:
expand_folders=True
). Root folders weren't expanded because we relied (in theory) on bdbag to recursively fetch based on a DOI. What is the case of a folder with a URI -- how did that work with fetch and can you give me an example?fold_hierarchy_smart
mentioned in the commentbdbag --validate-profile full
does fail. We can define our own profile. The only implication I can think of is when we try to publish our bags to DERIVA, since we're both using the same profile. Do we need to factor DERIVA integration into this PR? We probably should, but would require fleshing out publish.This has exposed how inadequate our testing has been for bags.
- Depending on when we validate the bag, the Payload-Oxum changes. The current oxum assumes validation before fetch and will fail post-fetch. I think the oxum needs to be valid for the post-fetch bag.
That was a simple mistake on my end. It's been fixed in https://github.com/whole-tale/girder_wholetale/pull/518/commits/1c2a06867cc9f9f533175ccd7a271c0699f4b170
- For future testing,
bdbag --validate full
ignores oxum,bdbag --validate fast
uses oxum,python -m bagit --validate
also uses oxum. Going forward, we should test all of these these variations and post-fetch. Also, to check the profile we need to usebdbag --validate-profile full
as a separate command.
All of the above, ~with an exception of profile validation~, should work post fetch.
Responding to your points:
- I think the changes in point 1 make sense, but to confirm: before this PR only non-root folders without URIs were expanded (with
expand_folders=True
). Root folders weren't expanded because we relied (in theory) on bdbag to recursively fetch based on a DOI. What is the case of a folder with a URI -- how did that work with fetch and can you give me an example?
globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec//published/publication_113/data 710251866817 data/data/data/
which is a part of doi:10.18126/M2301J and have a perfectly nice folder uri.
- Can you clarify/provide an example of what would break the new
fold_hierarchy_smart
mentioned in the comment
e.g. dataSet
where a file from dataset A was put in a folder from dataset B. Another example is doing something like "my_custom_label1/dataSet
are arbitrary. If you set them manually, only sky is the limit and a silly assumption that first element of the path means something goes out of the window...
bdbag --validate-profile full
does fail. We can define our own profile. The only implication I can think of is when we try to publish our bags to DERIVA, since we're both using the same profile. Do we need to factor DERIVA integration into this PR? We probably should, but would require fleshing out publish.
I reverted it for now (e805118), since I don't really need it atm. Fun fact though: apparently each manifest-data/LICENSE
. Bdbag is perfectly happy with it. So technically I can dump an empty sha256 and keep using sha512 and md5 as a default...
- What is the Dataverse tab-file workaround? Do we not calculate checksum as part of DMS when/if the file is downloaded? Could we store the calculated value for future exports instead of re-calculating?
It's calculated during registration only once, by manually downloading and running md5() on it. Result is stored in item["meta"]["checksum"]
(see server/lib/dataverse/provider.py:_get_attrs_via_get
(line 83))
To summarize what happen in this PR, cause there are a few nuances that made it work:
aggregates
section, only files. (see https://github.com/whole-tale/girder_wholetale/pull/518/commits/95d5b36f9d72955557b0814ae5a89322fd1eda75). This avoids situation when 1) object doesn't have auri
or 2) we end up withdoi:something data/folder
in bag's manifest. The latter, while neat wasn't even working with BDBag, cause it doesn't have a generic doi resolver.dataSet
generation during Tale import. This is important for a roundtrip of Tales that have folders or subfolders from external datasets indata/
. (see https://github.com/whole-tale/girder_wholetale/pull/518/commits/46a6f3073fe09e210854790e9fabbf41ba40f972)TaleExporter.verify_aggregate_checksums
in https://github.com/whole-tale/girder_wholetale/pull/518/commits/ce194f106258355112e47ed7de2de3dd5f319fca). I haven't checked how it's gonna behave for "very large data"^{TM}. Globus simply doesn't work right now. I'll handle it as separate PR (remind me to file an issue).identifier
that looks likedoi:doi:<>
.tl;dr Now any exported Tale with an arbitrary external data (from Zenodo, DataONE, Dataverse or raw HTTP) should be working with:
How to test?