architecture for data provider integration

mbjones commented 7 years ago

Need overview diagram and interactions between the front-end, back-ends, and data providers. Particularly, need to know how the following components will interact:

OwnCloud
DataONE
iRODS
Jupyter
RStudio

ian-taylor commented 7 years ago

Here is a stab at connecting some of the dots:

wt-arch-ideas

Xarthisius commented 7 years ago

To expand a little bit on Data box. The way that I was imagining the role of each components:

IRODS - since we don't have any data in it right now I assumed that it's gonna be used as a storage backend for Girder. This would require designing and implementing new girder assetstore. For Research Frontend (RF) that would require either using IRODS' native fuse filesystem or extending current GirderFS (I'd vote for the latter :) ). Unsolved issue: handling auth between RF and IRODS
DataONE - since there's ton of data/metadata already there, that would require implementation of assetstore with just an Import feature, similarly to what filesystem assetstore offers.

kylechard commented 7 years ago

I'd still like to understand the use cases for iRODS a little better. If we're using an assetstore model and we need this data to be accessible to various frontends, perhaps it would be worth exploring an entirely object store-based approach to the workspace/data fabric?

Xarthisius commented 7 years ago

Well, that'd be fine by me, especially since Girder already supports S3-compatible assetstores. However, I think there are two downsides:

there's already a large-scale IRODS deployment that's been devoted to WT
none of our cloud providers (Nebula @ NCSA, Rodeo @ TACC) offer object storage at the moment. There's an ongoing effort at NCSA but I'm not aware of any ETA

ian-taylor commented 7 years ago

The other question is: does ownCloud use the WT API, which then relays that to the iRODs (or other) via assetstore implementations or does it interface with iRODS directly (as we previously discussed). The former seems cleaner but it depends on the use cases, as Kyle said.

Xarthisius commented 7 years ago

@ian-taylor I think we can have both if necessary. The way I have it implemented for Filesystem assetstore in GirderFS is that:

you can access files remotely using Girder's API and utilizing Girder's auth token
you can access files directly if resources are available in the environment

Of course for filesystem that was fairly trivial, it's gonna be much more challenging for IRODS.

ian-taylor commented 7 years ago

Updated after comments:

wt-arch-ideas

matthewturk commented 7 years ago

@Xarthisius Not sure that it will be too much different if we use the iRODS FUSE interface; then it can be done just as a composed filesystem, right?

matthewturk commented 7 years ago

@ian-taylor this is a good start; it codifies a lot of the things we've spoken about, and has all of the items that have come up during discussion and technology identification. What we need at this point is considerably greater specific details. I think that's where the components need to be broken out, either by mechanism of interaction, by use case, or by type of technology.

A few of the specific items that need to be identified:

The double-ended arrow between the collection of web technologies and the WholeTale API. What exactly is this, and how will it be communicated? Is this going to be fully executed via REST API calls through client-side javascript? If so, we will need to develop a list of "needs" for the API, and then to implement that set of needs either in a WT-specific framework or in a collection of Grider+other things. That should be codified in a document, which I will get started in a separate issue.
The interaction between the "user data workspace" and WT API is not totally clear, and I think that the development of the "user data workspace" and the FUSE system needs to be clarified. As an example, a "proof of concept" that @Xarthisius and I implemented a long time ago (which is not suitable as-is, or perhaps ever) used TOTP tokens to create a one-time login that mounted ownCloud via DavFS2 inside the containers. How are we going to have a "user data workspace" and how would it need to communicate at all with the WT API?
"Data", such as DataOne, Globus, etc, needs to be much more clearly identified; some things will need to be pulled in and "copied" (which we discussed as always being an explicit operation) whereas others can be mounted remotely via FUSE. This was spelled out in the original proposal, but is subject now to review so that we can ensure it's forward-thinking. How will "data" be communicated to the research frontends?
User & Authorization is currently independent from the WT API, but I don't know if that's correct. One particular question I have is how deep down the stack do we want to push authorization? Will WT decide which users are able to access specific pieces of data, and then allow those into FUSE? I don't think this is necessary, but until we understand the way auth being passed around, we can't quite say.
I'm still not entirely sure where ORE fits in, and I would like to suggest that we not develop ORE bundles until absolutely necessary. I understand their utility, but unless we are both consuming and exporting ORE bundles, they are not necessary yet. (And I think the most likely sources and sinks of ORE are going to be Globus and DataOne.)

ian-taylor commented 7 years ago

A few quick comments. User & Authorization is supposed to be an expansion of User Management so it is part of the API. I need to clean that up in the image.

As for ORE, I agree. ORE aggregations are something we need to be thinking about once we have everything in place. It can be implemented using simple URL dereferencing on the research link and can pull from whatever we decide.

But we should think about how things will tie together i.e. how will we describe the files that researchers expose and how will we describe relationships? I personally think this needs to be independent of the physical storage e.g. use GUIDs like Gdrive does that allows files to be independent of paths/location, and use these GUIDs to link metadata in the collections. I am not sure Girder is the place to store this sort of metadata - it maybe better to separate the physical layer in Girder by using identifiers and a separate DB that describes the metadata for search. This seems to be the way DataOne works as far as I understood.

Xarthisius commented 7 years ago

@MatthewTurk I don't think IRODS' native fuse allows for selecting underlying objects. You can just export a path and you get everything with it. Nevertheless I wouldn't worry about it, wrapping icommands with GirderFS should be fairly trivial.

matthewturk commented 7 years ago

My presentation slides: https://docs.google.com/presentation/d/1a7a-jEPTTIx2Hka8fTn6DWcF_VRYrwgMYRp8pN5CllY/edit#slide=id.p

whole-tale / wt-design-docs

architecture for data provider integration #5