wkcn / MobulaOP

A Simple & Flexible Cross Framework Operators Toolkit
MIT License
164 stars 21 forks source link

Not working with multiple processes #40

Closed YutingZhang closed 4 years ago

YutingZhang commented 5 years ago

When calling MobulaOP in a subprocess, it gets stuck.

Environment: lastest mxnet nightly build and Python 3.6.5

An example code modified from dynamic_import_op.py to replicate this error.

from concurrent import futures

import sys
import mxnet as mx

def foo():
    import mobula
    # Import Custom Operator Dynamically
    AdditionOP = mobula.op.AdditionOP

    a = mx.nd.array([1, 2, 3])
    b = mx.nd.array([4, 5, 6])


    with mx.autograd.record():
        c = AdditionOP(a, b)

    dc = mx.nd.array([7, 8, 9])

    assert ((a + b).asnumpy() == c.asnumpy()).all()
    assert (a.grad.asnumpy() == dc.asnumpy()).all()
    assert (b.grad.asnumpy() == dc.asnumpy()).all()

    print('Okay :-)')
    print('a + b = c \n {} + {} = {}'.format(a.asnumpy(), b.asnumpy(), c.asnumpy()))

def main():
    ex = futures.ProcessPoolExecutor(1)
    r = ex.submit(foo)

if __name__ == "__main__":
wkcn commented 5 years ago

Thanks for your report! I will check it.

YutingZhang commented 5 years ago


FYI, If you move import mxnet as mx into foo(), the bug can disappear. But this is generally not doable because mxnet is usually imported in the main process. It may related to how mxnet works with subprocesses.

wkcn commented 5 years ago

moving import mobula and mobula.op.load('./AdditionOP') outside foo() may work, since MobulaOP will register operator into MXNet when mobula.op.load('./AdditionOP') is called. I will add a check to avoid duplicated register.

YutingZhang commented 5 years ago

I tried that, but it does not work. Example code:

from concurrent import futures

import sys
import mxnet as mx

import mobula
# Import Custom Operator Dynamically

def foo():

    AdditionOP = mobula.op.AdditionOP

    a = mx.nd.array([1, 2, 3])
    b = mx.nd.array([4, 5, 6])


    with mx.autograd.record():
        c = AdditionOP(a, b)

    dc = mx.nd.array([7, 8, 9])

    assert ((a + b).asnumpy() == c.asnumpy()).all()
    assert (a.grad.asnumpy() == dc.asnumpy()).all()
    assert (b.grad.asnumpy() == dc.asnumpy()).all()

    print('Okay :-)')
    print('a + b = c \n {} + {} = {}'.format(a.asnumpy(), b.asnumpy(), c.asnumpy()))

def main():
    ex = futures.ProcessPoolExecutor(1)
    r = ex.submit(foo)

if __name__ == "__main__":
wkcn commented 5 years ago

@YutingZhang Hi! I found the bug is not related to MobulaOP. It seems that MXNet triggers the bug.

from concurrent import futures

import mxnet as mx
import sys
from mobula.testing import assert_almost_equal
sys.path.append('../../')  # Add MobulaOP Path

class AdditionOP(mx.operator.CustomOp):
    def __init__(self):
        super(AdditionOP, self).__init__()
    def forward(self, is_train, req, in_data, out_data, aux):
        out_data[0][:] = in_data[0] + in_data[1]
    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        in_grad[0][:] = out_grad[0]
        in_grad[1][:] = out_grad[0]

class AdditionOPProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(AdditionOPProp, self).__init__()
    def list_arguments(self):
        return ['a', 'b']
    def list_outputs(self):
        return ['output']
    def infer_shape(self, in_shape):
        return in_shape, [in_shape[0]]
    def create_operator(self, ctx, shapes, dtypes):
        return AdditionOP()

def foo():
    a = mx.nd.array([1, 2, 3])
    b = mx.nd.array([4, 5, 6])


    with mx.autograd.record():
        c = mx.nd.Custom(a, b, op_type='AdditionOP')

    dc = mx.nd.array([7, 8, 9])

    assert_almost_equal(a + b, c)
    assert_almost_equal(a.grad, dc)
    assert_almost_equal(b.grad, dc)

    print('Okay :-)')
    print('a + b = c \n {} + {} = {}'.format(a.asnumpy(), b.asnumpy(), c.asnumpy()))

def main():
    ex = futures.ProcessPoolExecutor(1)
    r = ex.submit(foo)

if __name__ == '__main__':
YutingZhang commented 5 years ago

So mx.nd.Custom is the actual problem ... MxNet just has lots of bugs when running in subprocess ...

wkcn commented 5 years ago


YutingZhang commented 5 years ago

@wkcn Send you an email to your live.cn email :)

wkcn commented 5 years ago

Mail received. Thank you! : )

wkcn commented 5 years ago

Hi @YutingZhang , the two testcases you gave have been passed in the latest MXNet and MobulaOP : )

YutingZhang commented 5 years ago

@wkcn Thanks a lot! Did you work around the problem in MobulaOP? Or is it due to MxNet's update on CustomOP (you also contributed to this)?

wkcn commented 5 years ago

@YutingZhang It is due to MXNet’s update, and other contributors fixed it.

wkcn commented 4 years ago

Close it since the problem has been addressed. : )