Xarthisius commented 2 years ago

To summarize what happen in this PR, cause there are a few nuances that made it work:

External data in bags is unconditionally expanded, meaning no folders are recorded in aggregates section, only files. (see https://github.com/whole-tale/girder_wholetale/pull/518/commits/95d5b36f9d72955557b0814ae5a89322fd1eda75). This avoids situation when 1) object doesn't have a uri or 2) we end up with doi:something data/folder in bag's manifest. The latter, while neat wasn't even working with BDBag, cause it doesn't have a generic doi resolver.
Implementing 1. allows for a slightly easier/faster dataSet generation during Tale import. This is important for a roundtrip of Tales that have folders or subfolders from external datasets in data/. (see https://github.com/whole-tale/girder_wholetale/pull/518/commits/46a6f3073fe09e210854790e9fabbf41ba40f972)
I silently switched default sha256 algorithm in bags to sha512, since the latter is used by Girder and will be required anyway once we start exporting files directly from assetstores. (That happened in https://github.com/whole-tale/girder_wholetale/pull/518/commits/ce194f106258355112e47ed7de2de3dd5f319fca). Note: that it breaks the profile we're using (https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-ro-profile.json) which requires md5 and sha256. Note: bdbag doesn't complain... :-) Nevertheless, we should probably use our own profile?
We now collect checksums for external data directly from providers and use that to create a valid manifest.txt during bag export (see https://github.com/whole-tale/girder_wholetale/pull/518/commits/9e51c1f9f7716e97e4eb3517981ca3ec97ff2409). That comes with a caveat: not every provider has that info or sometimes info is missing for some files (Dataverse and tab-files, hence the ugly workaround) There are two cases where we don't know hashes: raw HTTP/HTTPS resources and Globus. The former we handle via downloading the data and calculating the md5sum during the Tale export, which is potentially a long process (see TaleExporter.verify_aggregate_checksums in https://github.com/whole-tale/girder_wholetale/pull/518/commits/ce194f106258355112e47ed7de2de3dd5f319fca). I haven't checked how it's gonna behave for "very large data"^{TM}. Globus simply doesn't work right now. I'll handle it as separate PR (remind me to file an issue).
I sneaked in a minor fix for Globus provider, cause I was too lazy to issue a separate PR (see https://github.com/whole-tale/girder_wholetale/pull/518/commits/b3f1d04b213ffdc5fe57a34916dbf7313987d753). This fixes a case when registered datasets have identifier that looks like doi:doi:<>.

tl;dr Now any exported Tale with an arbitrary external data (from Zenodo, DataONE, Dataverse or raw HTTP) should be working with:

How to test?

Register datasets from Zenodo, DataONE, Dataverse and register Ligo Tale.
Create a Tale that contains a subset of files from above. Mix of subfolder, root dataset folders and individual files would be the best.
Export Tale.

Verify the bag locally via:

$ bdbag --resolve-fetch all .
$ bdbag --validate full .

Keep the Tale's dataSet somewhere handy. Nuke everything and on a clean install import the bag. Confirm you got the same tale.

craig-willis commented 2 years ago

Sad that I didn't catch this eons ago. The fix works for D1, Zenodo, Dataverse.

My test case:

Create a new tale
Register and add files from each of the following:
- 10.5065/D6862DM8 (Data ONE)
- doi:10.7910/DVN/SZ9YXZ (DataVerse)
- doi:10.5281/zenodo.16384 (Zenodo)
- https://www.gw-openscience.org/events/GW150914/ (HTTP)
Export tale, extract zip, fetch, validate as above:

$ bdbag --resolve-fetch all .
$ bdbag --validate full .

What are your thoughts on Globus/MDF? I can't even register http://dx.doi.org/doi:10.18126/M2301J right now:

Traceback (most recent call last):
  File "/girder/girder/api/rest.py", line 630, in endpointDecorator
    val = fun(self, path, params)
  File "/girder/girder/api/rest.py", line 1230, in POST
    return self.handleRoute(method, path, params)
  File "/girder/girder/api/rest.py", line 970, in handleRoute
    val = handler(**kwargs)
  File "/girder/girder/api/access.py", line 63, in wrapped
    return fun(*args, **kwargs)
  File "/girder/girder/api/rest.py", line 445, in wrapped
    val = fun(*args, **kwargs)
  File "/girder/girder/api/describe.py", line 709, in wrapped
    return fun(*args, **kwargs)
  File "/girder/plugins/wholetale/server/rest/dataset.py", line 241, in importData
    Job().scheduleJob(job)
  File "/girder/plugins/jobs/server/models/job.py", line 346, in scheduleJob
    events.trigger('jobs.schedule', info=job)
  File "/girder/girder/events.py", line 314, in trigger
    handler['handler'](e)
  File "/girder/plugins/jobs/server/__init__.py", line 43, in scheduleLocal
    fn(job)
  File "/girder/plugins/wholetale/server/tasks/register_dataset.py", line 33, in run
    importedData = register_dataMap(
  File "/girder/plugins/wholetale/server/lib/__init__.py", line 107, in register_dataMap
    objType, obj = provider.register(
  File "/girder/plugins/wholetale/server/lib/import_providers.py", line 86, in register
    for item in self._listRecursive(user, pid, name, base_url, progress=progress):
  File "/girder/plugins/wholetale/server/lib/globus/globus_provider.py", line 98, in _listRecursive
    tc = self.clients.getUserTransferClient(user)
  File "/girder/plugins/globus_handler/server/clients.py", line 39, in getUserTransferClient
    self.userClients[username] = TransferClient(authz)
TypeError: __init__() takes 1 positional argument but 2 were given

Xarthisius commented 2 years ago

Globus unfortunately doesn't provide checksums. There might be a way to query them, but I'm still investigating.

codecov[bot] commented 2 years ago

Codecov Report

Merging #518 (e9986ee) into master (2e2ae35) will decrease coverage by 0.57%. The diff coverage is 94.95%.

@@            Coverage Diff             @@
##           master     #518      +/-   ##
==========================================
- Coverage   93.41%   92.84%   -0.58%     
==========================================
  Files          54       54              
  Lines        4179     4265      +86     
==========================================
+ Hits         3904     3960      +56     
- Misses        275      305      +30

Impacted Files	Coverage Δ
server/lib/dataone/register.py	`76.25% <ø> (-7.20%)`	:arrow_down:
server/rest/tale.py	`96.44% <ø> (-0.04%)`	:arrow_down:
server/lib/globus/globus_provider.py	`63.55% <66.66%> (-0.26%)`	:arrow_down:
server/lib/exporters/__init__.py	`96.77% <85.71%> (-1.41%)`	:arrow_down:
server/lib/dataverse/provider.py	`97.04% <92.30%> (-0.24%)`	:arrow_down:
server/lib/manifest_parser.py	`79.05% <95.00%> (-14.70%)`	:arrow_down:
server/__init__.py	`93.25% <100.00%> (+0.06%)`	:arrow_up:
server/lib/dataone/provider.py	`100.00% <100.00%> (+3.17%)`	:arrow_up:
server/lib/exporters/bag.py	`100.00% <100.00%> (ø)`
server/lib/manifest.py	`92.97% <100.00%> (+0.90%)`	:arrow_up:
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1639360...e9986ee. Read the comment docs.

craig-willis commented 2 years ago

This has exposed how inadequate our testing has been for bags.

Depending on when we validate the bag, the Payload-Oxum changes. The current oxum assumes validation before fetch and will fail post-fetch. I think the oxum needs to be valid for the post-fetch bag.
For future testing,bdbag --validate full ignores oxum, bdbag --validate fast uses oxum, python -m bagit --validate also uses oxum. Going forward, we should test all of these these variations and post-fetch. Also, to check the profile we need to use bdbag --validate-profile full as a separate command.

Responding to your points:

I think the changes in point 1 make sense, but to confirm: before this PR only non-root folders without URIs were expanded (with expand_folders=True). Root folders weren't expanded because we relied (in theory) on bdbag to recursively fetch based on a DOI. What is the case of a folder with a URI -- how did that work with fetch and can you give me an example?
Can you clarify/provide an example of what would break the new fold_hierarchy_smart mentioned in the comment
bdbag --validate-profile full does fail. We can define our own profile. The only implication I can think of is when we try to publish our bags to DERIVA, since we're both using the same profile. Do we need to factor DERIVA integration into this PR? We probably should, but would require fleshing out publish.
What is the Dataverse tab-file workaround? Do we not calculate checksum as part of DMS when/if the file is downloaded? Could we store the calculated value for future exports instead of re-calculating?

Xarthisius commented 2 years ago

This has exposed how inadequate our testing has been for bags.

Depending on when we validate the bag, the Payload-Oxum changes. The current oxum assumes validation before fetch and will fail post-fetch. I think the oxum needs to be valid for the post-fetch bag.

That was a simple mistake on my end. It's been fixed in https://github.com/whole-tale/girder_wholetale/pull/518/commits/1c2a06867cc9f9f533175ccd7a271c0699f4b170

For future testing,bdbag --validate full ignores oxum, bdbag --validate fast uses oxum, python -m bagit --validate also uses oxum. Going forward, we should test all of these these variations and post-fetch. Also, to check the profile we need to use bdbag --validate-profile full as a separate command.

All of the above, ~with an exception of profile validation~, should work post fetch.

Responding to your points:

I think the changes in point 1 make sense, but to confirm: before this PR only non-root folders without URIs were expanded (with expand_folders=True). Root folders weren't expanded because we relied (in theory) on bdbag to recursively fetch based on a DOI. What is the case of a folder with a URI -- how did that work with fetch and can you give me an example?

globus://82f1b5c6-6e9b-11e5-ba47-22000b92c6ec//published/publication_113/data 710251866817 data/data/data/

which is a part of doi:10.18126/M2301J and have a perfectly nice folder uri.

Can you clarify/provide an example of what would break the new fold_hierarchy_smart mentioned in the comment

e.g. dataSet where a file from dataset A was put in a folder from dataset B. Another example is doing something like "my_custom_label1/". Mountpaths in dataSet are arbitrary. If you set them manually, only sky is the limit and a silly assumption that first element of the path means something goes out of the window...

bdbag --validate-profile full does fail. We can define our own profile. The only implication I can think of is when we try to publish our bags to DERIVA, since we're both using the same profile. Do we need to factor DERIVA integration into this PR? We probably should, but would require fleshing out publish.

I reverted it for now (e805118), since I don't really need it atm. Fun fact though: apparently each manifest- doesn't have to contain all the files. Right now I dump only md5 fully. Sha256 contains only hash for data/LICENSE. Bdbag is perfectly happy with it. So technically I can dump an empty sha256 and keep using sha512 and md5 as a default...

What is the Dataverse tab-file workaround? Do we not calculate checksum as part of DMS when/if the file is downloaded? Could we store the calculated value for future exports instead of re-calculating?

It's calculated during registration only once, by manually downloading and running md5() on it. Result is stored in item["meta"]["checksum"] (see server/lib/dataverse/provider.py:_get_attrs_via_get (line 83))

whole-tale / girder_wholetale

Create valid BDBags with external data #518

How to test?

Codecov Report