Suggestion for nms layer (detection_output_layer.cu)

weiliu89 / caffe

Caffe: a fast open framework for deep learning.

http://caffe.berkeleyvision.org/

Other

4.77k stars 1.68k forks source link

Suggestion for nms layer (detection_output_layer.cu) #195

Open hengck23 opened 8 years ago

hengck23 commented 8 years ago

For detection_output_layer.cu, nms is done per class. E.g.

for class = 1: classnum
..... select scores for the class
..... sort the scores
..... apply nms
end

Since sorting is an expensive process, a more efficient way to do multiclass nms would be:

select max score (i.e. for each box, assign only the score of max class)
sort the scores
apply nms (but do not suppress of overlap box is of different class)

This get rid of the class loop.

weiliu89 commented 8 years ago

Thanks for the suggestion. Would that decrease the accuracy? For example, the recall per class might be lower since you only select the score of max class per bbox.

Since there is a threshold (0.01), which can get rid of many boxes, and sorting is relatively cheap. But nms does indeed take quite a bit of time especially given that GPU is getting faster and faster.

hengck23 commented 8 years ago

@weiliu89 Thanks for the comment.

Would that decrease the accuracy? How about select the top K class per box? e.g. K=3. My target is mobile TX1. It seems that trust based sorting (e.g. thrust::sort_by_key) is not friendly on such devices.

weiliu89 commented 8 years ago

@hengck23 I think it will sacrifice the accuracy even if you keep top 3 classes per box. I am curious why thrust is not friendly on TX1? My recent code doesn't use Thrust anymore because the data transferring time between GPU and CPU is more than the sorting time in CPU.

I think another solution could be that you do nms on detections from all classes. So there is only one 'giant' sorting (hopefully TX1 GPU can be much faster than CPU), and only one data transferring between GPU and CPU. It might speed up the nms step a little bit.

VincentChong123 commented 8 years ago

@weiliu89, @hengck23 Besides memory copy overhead, Thrust is not friendly for Tx1 because Thrust not supporting fp16. In a *fp16-based Tx1 application, use has to convert input fp16 to fp32 for Thrust library to sort in 32bit.

*thrust is intended to work on fundamental and POD types only, and the CUDA fp16 half is not a POD type.

weiliu89 commented 8 years ago

@weishengchong Is it slow to convert fp16 to fp32? So you have a fp16 version of SSD? If so, is that 2x faster than the fp32 version?

VincentChong123 commented 8 years ago

@weiliu89 Tx1's fp32 fp16 conversion is efficient because it is single instruction realized in hardware. FYI, Tx1's __float2half() in SASS instruction would be STL.U16 [R2], R0;

After reading your paper, I used Thrust to optimize our Faster-RCNN network that is still in fp32. I have not yet tried SSD. FYI, attached is fp16 and fp32 benchmark from Tx1 white paper that is based on cudnn4.

alexnet-fp16-fp32 googlenet-fp16-fp32

weiliu89 commented 8 years ago

Thanks for the information!! nms is always a bottleneck, and need much engineering to make it fast :)