custom task runners - Githubissues

mivade commented 6 years ago

It would be nice to be able to use a custom task runner instead of only having the choice between blocking the request thread or using Celery. This would be particularly useful in cases where one might have access to a server and be able to start local services but not have permissions to install a broker required by Celery. If the task runners were pluggable, a simpler custom service could be written (e.g. using ZeroMQ as the backend) to run tasks out of the request thread.

Chris7 commented 6 years ago

Hi @mivade,

It sounds like a great idea. I imagine a class could be written that has to be subclassed for each implementation to handle how each backend would update the database, run the job, re-run it, etc. Can you provide some more information about how you would like to submit tasks so I can get a better notion about how this can be implemented?

mivade commented 6 years ago

What I had in mind for a simple ZMQ-based task runner is a service that accepts a JSON string telling it what script to run and with what command-line arguments, then it can spawn a new process (or use an existing process pool or whatever). I imagine a base TaskRunner class that might look something like the following:

from uuid import uuid4
import zmq

class TaskRunner(object):
    def __init__(self, script, args):
        self.script = script
        self.args = args
        self.job_id = uuid4().hex

    def run(self):
        raise NotImplementedError

    def check_status(self):
        raise NotImplementedError

class ZMQTaskRunner(TaskRunner):
    def __init__(self, script, args):
        super(ZMQTaskRunner, self).__init__(script, args)
        self.ctx = zmq.Context()
        self._socket = None

    @property
    def socket(self):
        if self._socket is None:
            self._socket = self.ctx.socket(zmq.REQ)
        return self._socket

    def run(self):
        self.socket.send_json({
            'command': 'submit',
            'data': {
                'script': self.script,
                'id': self.job_id,
                'args': self.args
            }
        })

        # await acknowledgment and do whatever...
        self.socket.recv_json()

    def check_status(self):
        self.socket.send_json({
            'command': 'status',
            'data': {
                'id': self.job_id
            }
        })

        # do whatever with status...
        status = self.socket.recv_json()

I'm not sure exactly how this fits in with the Celery implementation, but there should be a way to generalize this. At any rate, I think Wooey need not directly implement alternative task runners, but it would be nice to provide some sort of interface like this.

Chris7 commented 6 years ago

It should be doable -- Wooey just needs to have a mechanism for sending out a task in JSON, and receiving information about that task. We just need to make a contract that a task runner needs to fulfill like:

Returning a unique job id when a task is received
Indicating error codes and job state
Some mechanism to indicate a job is finished. I don't want a polling mechanism for this, the task runner should be responsible for indicating when a job is finished.

And some optional nice-to-haves:

Returning the current stdout/stdin
Ability to abort

If you feel comfortable getting a task-runner together I can see about putting the relevant changes in Wooey.

Chris7 commented 6 years ago

There is also an issue of security. I see 2 approaches:

Use the UUID as a token.
Add in tokens for users. This would open up the possibility for calling scripts through an api as a user and every user can have their own token automagically made so it's backwards compatible and on by default. This, combined with a UUID per job seems fairly secure. I'm no crypto expert though.

The flaw with the first approach is there is nothing actually secure about it and it's impossible to fix once security is breached. It also assumes each task runner generates a UUID or a string of sufficient randomness from a PRNG. That is likely not true.

mivade commented 6 years ago

I don't want a polling mechanism for this, the task runner should be responsible for indicating when a job is finished

How is this handled now with Celery? I haven't used Celery too much, but my understanding is that you just check the status of a task to see if it's done or not which sounds like polling to me.

I think the other requirements are pretty straightforward (apart from the very valid security requirements, but I'm not very well versed in authentication strategies). I'm willing to take a stab at this but could use some guidance about where this would make most sense to plug in to the existing code.

Chris7 commented 6 years ago

It is handled with Celery by the task updating the models with the state of the job. The task drives the state and there is no polling. You can see the code here:

https://github.com/wooey/Wooey/blob/master/wooey/tasks.py#L78

If you want to poll with your task runner, that is fine. My point is that I don't want Wooey to have to poll anything to update a model with the state of a job.

I'm willing to take a stab at this but could use some guidance about where this would make most sense to plug in to the existing code.

That's awesome! I'll put together a guide for how I would get started on this.

mivade commented 6 years ago

Thanks for the clarification. I think I was a bit confused since the frontend has to poll for results, but what you're describing makes a lot of sense.

The tricky part in getting this to work without polling for general backends is that the backend would have to support it somehow. In the case of Celery this is pretty easy since tasks are literally just executing blocks of code that have access to the same underlying models as the web app itself. For other backends, such as a custom ZMQ one like above, that would have to explicitly be implemented on the server side. I'm actually more familiar with using async frameworks like Tornado than with WSGI, so it would seem to me like falling back on polling (e.g., triggered whenever the JS frontend calls the jobRefresh function) might be necessary to offer as an option.

Chris7 commented 6 years ago

I think I know what you might be getting at. Do you intend your worker to be somewhere without any relation to the Wooey server? i.e. no Django, and Wooey is simply a management/reporting interface?

mivade commented 6 years ago

Yes, that would be the idea.

Chris7 commented 6 years ago

The issue with the frontend controlling the polling is there is no promise of state being maintained by the task runner. If a user checks back in a month, will that job be guaranteed to still exist on the backend?

I think it's pretty clear there needs to be a non-blocking poller, but I'm not convinced of any particular implementation. I have two ideas on this front:

Use celery as the poller. It's already going to be installed, but this requires a rabbit/etc. somewhere
A separate management command can be added to start a poller process for the task runner. This might be the most appropriate because those who want a separate task runner likely are doing so because they won't be using celery at all.

The second option I think is clearly better and lets us add any needed functionality. I think just ironing out the spec is what's needed and your task runner example is pretty close to covering everything.

mivade commented 6 years ago

That's a good point. I like the approach of adding an additional command for running the poller, though I'm not sure offhand what the best way to implement that would be apart from defining custom pollers for each custom task runner. I'll think a bit more about this and try to come up with a demo.

wooey / Wooey

custom task runners #194