Low performance in gpu mode

damNull commented 5 years ago

I wrote my first demo of Mobula op. The directory of my project:

mobula_test
  │  main.py
  └──TestOP
      └───TestOP.cpp

The content of files: main.py:

import mobula
import mxnet as mx
from mxnet import nd
from tqdm import tqdm

if __name__ == '__main__':
    mobula.op.load('TestOP')
    ctx = mx.cpu()
    a = nd.ones((5000, 5000), ctx=ctx)
    b = nd.ones((5000, 5000), ctx=ctx)
    out = nd.empty(a.shape, ctx=ctx)

    print("cpu")
    for i in tqdm(range(1000)):
        mobula.func.TestOP(a.size, a, b, out)

    ctx = mx.gpu()
    a = nd.ones((5000, 5000), ctx=ctx)
    b = nd.ones((5000, 5000), ctx=ctx)
    out = nd.empty(a.shape, ctx=ctx)

    print("gpu")
    for i in tqdm(range(1000)):
        mobula.func.TestOP(a.size, a, b, out)

TestOP.cpp:

template<typename DType>
MOBULA_KERNEL TestOP_kernel(const int n, const DType* a, const DType* b, DType* out)
{
    parfor(n, [&](int i)
    {
        out[i] = a[i] + b[i];
    });
}

time cost: cpu 14s, gpu 226s on i7-7700k & 1080ti. The usage of cpu and gpu is both 100% os environment: win10 1809, cuda 10.0

wkcn commented 5 years ago

Thanks for your report! I will look it.

damNull commented 5 years ago

Seems the problem only occurs on windows. I test the code on AWS K80 instance. Its performance is normal as expected.

wkcn commented 5 years ago

Currently, MobulaOP uses MXTVMBridge to implement asynchronous execution, however it doesn't support on Windows, because of the ABI problem

I disable asynchronous execution mode defaultly on Windows. https://github.com/wkcn/MobulaOP/blob/master/mobula/build_utils.py#L45

If the compiler which builds MXNet is the same as that building MobulaOP, it is available to enable the flag USING_ASYNC_EXEC manually.

I will fix the issue when MXNet provides CPackedFunc API.

damNull commented 5 years ago

Thanks : )

wkcn / MobulaOP

Low performance in gpu mode #39