whole-tale / girder_wholetale

Girder plugin providing basic Whole Tale functionality
BSD 3-Clause "New" or "Revised" License
3 stars 5 forks source link

Bdbag imports #497

Closed hategan closed 3 years ago

hategan commented 3 years ago

Adds a BDBag import provider, which allows importing datasets from BDBags and an API endpoint that can be used to import the bags.

A minor change is also added to the existing import logic to allow a provider to import files directly into the dataset rather than with links and through the DMS.

hategan commented 3 years ago

Collection_16-E08W-no-fetch.zip Reporter_Cell_Line_14-3QDC-mixed.zip

@craig-willis Here are a couple of bags exported from DERIVA sites. One has only embedded files (no-fetch) while the other has both embedded files and links in fetch.txt. I do not have one that has links only.

The import should succeed, but accessing the files in fetch.txt may or may not work, depending on whether the files are accessible without logging in or not. The one file I tried downloaded just fine with wget.

The security/login part will likely need additional discussion. Unlike the handler for globus:// URLs, the https handler does not know to put Globus tokens in the headers. Furthermore, scopes would likely need to be added.

craig-willis commented 3 years ago

Per last dev call, I'm reviewing this with the assumption that we do want to go forward with storing data in the catalog and that this is required/desired for DERIVA integration.

My test case:

This all worked fine for me without authentication for the provided examples.

A few notes/observations:

hategan commented 3 years ago

Thanks for testing this.

  • Curious if a separate endpoint is needed for this, or if it might fit into our current search+register workflow - trying to plan how we can expose this functionality in the UI, but would like to avoid adding an "Is Bag" checkbox (if possible).. maybe a separate dropdown option under Run > Files > External Data > (+)?

It's my understanding that this will already differ from existing imports because of the need to upload the bag (e.g., you can't simply enter a URL, but have to pick a file using a file dialog).

That said, both url and upload can coexist in the same endpoint. We can change this accordingly depending on what we decide as a long term strategy.

  • Importing BDBag notifications showed up, but "View Logs" was empty - if we're not already, we could output some job logs from this import process to provide additional info to the end-user

This is supposed to use the existing import provider infrastructure, so I'm not sure why it's behaving differently, log-wise.

codecov[bot] commented 3 years ago

Codecov Report

Merging #497 (c577f67) into master (aafc76a) will decrease coverage by 0.02%. The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #497      +/-   ##
==========================================
- Coverage   93.31%   93.29%   -0.03%     
==========================================
  Files          53       54       +1     
  Lines        3967     4103     +136     
==========================================
+ Hits         3702     3828     +126     
- Misses        265      275      +10     
Impacted Files Coverage Δ
server/lib/bdbag/bdbag_provider.py 91.83% <91.83%> (ø)
server/rest/dataset.py 88.09% <94.44%> (+2.07%) :arrow_up:
server/lib/__init__.py 96.00% <100.00%> (+0.08%) :arrow_up:
server/lib/import_providers.py 91.08% <100.00%> (+0.27%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update aafc76a...c577f67. Read the comment docs.

Xarthisius commented 3 years ago

Looks like all that's left is to write some tests for it.