wkcn / MobulaOP

A Simple & Flexible Cross Framework Operators Toolkit
MIT License
164 stars 21 forks source link

ROIAlign custom op runs slowly #34

Closed makefile closed 4 years ago

makefile commented 5 years ago

Hi, I have tried to use the ROIAlign custom op provided by the repo, and I run faster rcnn examples, I just simplily replace the symbol code:

roi_pool = mx.symbol.ROIPooling(name='roi_pool', data=conv_new_1_relu, rois=rois, 
           pooled_size=(7, 7), spatial_scale=spatial_scale)

with

roi_pool = mobula.op.ROIAlign(name='roi_pool', data=conv_new_1_relu, rois=rois,
           pooled_size=(7, 7), spatial_scale=spatial_scale, sampling_ratio=0)

the running speed decrease from 0.1s to 1~2s, and when use multi-gpu, the code cannot run parallel, and become much more slower. My mxnet version is 1.3.0-cu92 from pip install. What might be the problem?

wkcn commented 5 years ago

It is a performance problem, since MobulaOP will call wait_to_read and wait_to_write for intputs and outputs, and MobulaOP is implemented by Python, which drops the computation performance.

I will refactor it in C++.

For ROIAlign Operator, I recommend mxnet.symbol.contrib.ROIAlign.

makefile commented 5 years ago

Thanks for your great work, expecting your refactor. Writing c++ layer then compile the mxnet source tree is really painful. It will be good if MobulaOP runs faster than the python CustomOp using numpy.

wkcn commented 5 years ago

Thanks! : )

wkcn commented 5 years ago

@makefile Hi! MobulaOP supports asychronous exection in MXNet, Linux, and it may runs faster.

hustzxd commented 5 years ago

Thanks for your wonderful job. By the way, can the code run in multi-gpu?

makefile commented 5 years ago

@wkcn Thanks very much!

wkcn commented 5 years ago

@hustzxd Sorry that I do not have the server with multi-gpu to test it now. Theoretically it works on multi-gpu.

YutingZhang commented 5 years ago

@wkcn Thanks for your amazing work. It can run on multiple GPUs. The problem is that wait_to_read may lock the computation on other GPUs as well, which results in low performance on multiple GPUs.

wkcn commented 5 years ago

@YutingZhang Currently, On Linux, MobulaOP supports asychronous exection, which does not call wait_to_read.

wkcn commented 5 years ago

Hi @makefile and @YutingZhang , I have added TVM bridge into MobulaOP, and MobulaOP enables asynchronous execution for MXNet by default. There is not any wait_to_read lock to synchronize. Moreover, MobulaOP works in multi-gpu context.

makefile commented 5 years ago

Thanks! @wkcn

YutingZhang commented 5 years ago

Awesome! Thank you. @wkcn

wkcn commented 4 years ago

A simple benchmark: https://github.com/wkcn/MobulaOP/blob/master/benchmark/benchmark_roi_align.py MXNet is the baseline 100%.

MobulaOP(83e10986740410aab314b29fd5ad3dfa2d3a8b47) uses the default configuration (disable openmp).

Implement time (s) ratio
MobulaOP 36.303 96.96%
MXNet 35.199 100%
Implement time (s) ratio
MobulaOP 43.229 99.94%
MXNet 43.201 100%

The issue has been addressed. Close it : )