Open hengck23 opened 8 years ago
Thanks for the suggestion. Would that decrease the accuracy? For example, the recall per class might be lower since you only select the score of max class per bbox.
Since there is a threshold (0.01), which can get rid of many boxes, and sorting is relatively cheap. But nms does indeed take quite a bit of time especially given that GPU is getting faster and faster.
@weiliu89 Thanks for the comment.
Would that decrease the accuracy? How about select the top K class per box? e.g. K=3. My target is mobile TX1. It seems that trust based sorting (e.g. thrust::sort_by_key) is not friendly on such devices.
@hengck23 I think it will sacrifice the accuracy even if you keep top 3 classes per box. I am curious why thrust is not friendly on TX1? My recent code doesn't use Thrust anymore because the data transferring time between GPU and CPU is more than the sorting time in CPU.
I think another solution could be that you do nms on detections from all classes. So there is only one 'giant' sorting (hopefully TX1 GPU can be much faster than CPU), and only one data transferring between GPU and CPU. It might speed up the nms step a little bit.
@weiliu89, @hengck23 Besides memory copy overhead, Thrust is not friendly for Tx1 because Thrust not supporting fp16. In a *fp16-based Tx1 application, use has to convert input fp16 to fp32 for Thrust library to sort in 32bit.
@weishengchong Is it slow to convert fp16 to fp32? So you have a fp16 version of SSD? If so, is that 2x faster than the fp32 version?
@weiliu89 Tx1's fp32 fp16 conversion is efficient because it is single instruction realized in hardware. FYI, Tx1's __float2half() in SASS instruction would be STL.U16 [R2], R0;
After reading your paper, I used Thrust to optimize our Faster-RCNN network that is still in fp32. I have not yet tried SSD. FYI, attached is fp16 and fp32 benchmark from Tx1 white paper that is based on cudnn4.
Thanks for the information!! nms is always a bottleneck, and need much engineering to make it fast :)
For detection_output_layer.cu, nms is done per class. E.g.
Since sorting is an expensive process, a more efficient way to do multiclass nms would be:
This get rid of the class loop.