Discuss File Registration/Tale Importing as a Job

ThomasThelen commented 6 years ago

Right now we're using the notification stream while registering data. This works fine, but for large datasets (https://search.dataone.org/view/doi:10.18739/A2NK36467) it's hard to see the progress. In addition, there isn't a way to check if a dataset is being registered which we might want to do (should we let a user launch/publish a tale while their data is still being imported?)

In addition, I think that Tale importing should also be a job. We're going to want to parse the EML metadata when it comes in, potentially create new items (with descriptions/names taken from the metadata), we may also need to build a new image and recipe if the image doesn't exist.

Xarthisius commented 6 years ago

@hategan PTAL, this is relevant to the task you're currently working on. It'd be great if you could brainstorm that together.

hategan commented 6 years ago

There is that model that takes everything that is potentially long-running and makes it asynchronous and there's a nice task list somewhere, an icon with the active/failed/completed tasks, etc.

Do we have an infrastructure for this?

ThomasThelen commented 6 years ago

If you're referring to the image below (the job watcher), it's only interface-able by using jobs, so any code in girder-wholetale isn't compatible with it (instead it uses the notification stream (at the bottom of this message)).

screen shot 2018-08-14 at 4 20 35 pm

I wrote the publishing code in girder-wholetale but I'm in the process of porting it to gwvolman so that we can show its progress in the job watcher. As I'm doing this I'm thinking about how registration is also in girder-wholetale, but I'm considering moving it into gwvolman and run it as a job. It would give the user a central place to check the status of their tasks, and tale importing will probably use it.

This would probably need to get an okay from the PI team, @mbjones might have some input on this.

screen shot 2018-08-14 at 4 23 14 pm

hategan commented 6 years ago

By "do we have an infrastructure for this?" I mean do we have some plugin/library endowed with some reasonable user interface that can be used from any other plugin to wrap some long-running task?

I'm assuming "jobs" is a girder plugin. Can it be used as a dependency from girder-wholetale or does it require that the relevant code be moved to another plugin? If the latter, then it probably doesn't fit my idea of proper infrastructure.

Xarthisius commented 6 years ago

@hategan we do. jobs is already a dependency for girder_wholetale. We're using that infra for building docker images, creating/destroying instances. To be exact: jobs is rather abstract. There's a particular implementation using celery (plugin: worker) that ties into that abstract interface and implements actual functionality.

ThomasThelen commented 6 years ago

To piggy back off of @Xarthisius, an example is in server.models.instancy.py where we from gwvolman.tasks import create_volume, launch_container

and then

        # Create single job
        volumeTask = create_volume.signature(args=[payload])
        serviceTask = launch_container.signature(queue='manager')
        (volumeTask | serviceTask).apply_async()

hategan commented 6 years ago

Sorry for dragging this, but the part I don't understand then is "so any code in girder-wholetale isn't compatible with it".

Xarthisius commented 6 years ago

@ThomasThelen "jobs" in that example are hidden. I think it's better to look at

https://github.com/whole-tale/girder_wholetale/blob/master/server/rest/image.py#L287-L307

ThomasThelen commented 6 years ago

I should have more clear about what I meant by that. We can spawn the job from girder-wholetale, but the code that executes within the job must reside in the job plugin-in this case gwvolman (hence me porting the publishing stuff out). Unless I'm mistaken.

Xarthisius commented 6 years ago

Technically it can be anywhere, it's just a matter of installing that code in a place where celery can access it. Here's an example of python package that's actually a combination of 1) girder server extension, 2) girder ui extension and 3) a celery task:

https://github.com/kotfic/gwpca

If you think that makes more sense, we could talk about incorporating gwvolman into girder_ythub.

hategan commented 6 years ago

Are we planning to use the distributed aspect of celery or should we have a local, thread pool based implementation of jobs? I'm thinking the latter would allow us to pass objects/lambdas around, make it easier to keep code where it belongs, and, last but surely not least, make debugging significantly easier.

Xarthisius commented 6 years ago

I don't think we were planning it, simply because we didn't have a need for that. While I agree with the advantages that you've mentioned, I'd rather avoid putting more computational burden on the girder side. It already needs to deal with data management, transfers etc. (wt_data_manager is threaded already right?)

hategan commented 6 years ago

Yes. It's I/O bound though.

If celery is running on a different machine?

Xarthisius commented 6 years ago

Yup, it's running everywhere, except on the machine that has i/o storage and hosts girder.

hategan commented 6 years ago

That makes sense. I'm inclined to do a local implementation for the first iteration to limit the scope and switch to jobs once that works.

ThomasThelen commented 6 years ago

That sounds good to me-I did it that way too :)

whole-tale / girder_wholetale

Discuss File Registration/Tale Importing as a Job #152