Open Larry-C opened 5 years ago
Have you seen #219 and #210? They obtain slower timing than the ones reported but still much closer than yours.
It looks like your "net" time is much higher, comparing with #219
yes, and i dont know the reason.i also ran demo.py with the weight ctdet_coco_dla_2x downloaded in th MODEL_ZOO, it still took about 0.23s.
I did some tests, with a k80 (much worse than your GPU) instance and some sample images resized to 512x512:
Tot: 0:06:44 |ETA: 0:00:01 |tot 0.099s (0.101s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.096s (0.094s) |dec 0.001s (0.004s) |post 0.001s (0.002s) |merge 0.000s (0.000s)
Changed pin_memory
to False
in this line https://github.com/xingyizhou/CenterNet/blob/master/src/test.py#L62
and on my GTX 1070, run with
python src/test.py ctdet --load_model ./models/ctdet_coco_dla_2x.pth --fix_res
The result is
22.73 fps with net time 0.039s
By fixing resolution, run with
python src/test.py ctdet --load_model ./models/ctdet_coco_dla_2x.pth --fix_res
The result is
25.04 fps with net time 0.035s
When testing with pin_memory=True
, it is only about 13~14 fps
.
@ggsggs Could you check your result for pin_memory=False
?
Yes, @wangg12:
I tried both with pin_memory=False
and pin_memory=False
+ --fix_res
, per your request.
The timings are similar in my case, there are no significant differences, hovering between tot = 0.095
and tot = 0.100
.
I am using a retrained CenterNet with just 2 classes, I do not know how relevant that is for the timing.
@Larry-C: Instead of just forwarding one tensor through the net, have you tried forwarding multiple images in successive fashion? I noticed that the timing starts decreasing with each iteration:
e_90_save_pin_mem | | [0/3978]|Tot: 0:00:00 |ETA: 0:00:00 |tot 0.311s (0.311s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.305s (0.305s)
e_90_save_pin_mem | | [1/3978]|Tot: 0:00:00 |ETA: 0:24:30 |tot 0.119s (0.215s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.115s (0.210s)
e_90_save_pin_mem | | [2/3978]|Tot: 0:00:00 |ETA: 0:16:18 |tot 0.113s (0.181s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.109s (0.176s)
e_90_save_pin_mem | | [3/3978]|Tot: 0:00:00 |ETA: 0:13:26 |tot 0.107s (0.162s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.103s (0.158s)
.
.
.
That is weird. Have you changed other things in the test script, like cudnn benchmark?
No, I haven't. Sorry, I edited my previous comment, I added a comment: I reported the timings for a retrained CenterNet, it only detects 2 classes. IDK how much that will affect the comparison.
I note that you posted the time for first few images, I think you should benchmark the timing by letting the network run at least a few hundred images. The gpu in the beginning is not fully burning.
I note that you posted the time for first few images, I think you should benchmark the timing by letting the network run at least a few hundred images. The gpu in the beginning is not fully burning.
In my previous response, I wanted to show the diminishing times for Larry-C, but I did compare pin-memory=False
when the timings were stable. For your reference:
e_90_save_pin_mem |## | [366/3978]|Tot: 0:00:37 |ETA: 0:06:07 |tot 0.097s (0.098s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.093s (0.094s) |dec 0.001s (0.002s) |post 0.001s (0.001s) |merge 0.000s (0.000s)
@ggsggs i run the test.py with "pin_memory=False" and "--fix_res" and the result is
coco_dla | [37/5026]|Tot: 0:00:03 |ETA: 0:06:42 |tot 0.074s (0.077s) |load 0.000s (0.000s) |pre 0.001s (0.003s) |net 0.066s (0.067s) |dec 0.002s (0.002s) |post 0.004s (0.004s) |merge 0.000s (0.000s)
however when i run on a new GTX TITAN the result is
det 0.077s
still far different from the paper result also forwarding multiple images didnt help a lot. the running time didnt decrease. thx for u help!
@ggsggs i run the test.py with "pin_memory=False" and "--fix_res" and the result is
coco_dla | [37/5026]|Tot: 0:00:03 |ETA: 0:06:42 |tot 0.074s (0.077s) |load 0.000s (0.000s) |pre 0.001s (0.003s) |net 0.066s (0.067s) |dec 0.002s (0.002s) |post 0.004s (0.004s) |merge 0.000s (0.000s)
So now you are around 15 fps? At least it improved a bit, from your previous ~4-5fps. but it is still far from the paper results. I have no clue how to help with that 😟.
thx for u help!
No problem, glad I could hopefully help a bit
@Larry-C Which TITAN were you using? And did you benchmark the time without any other tasks running on the machine?
Thanks for the discussion. We also observed abnormal testing time on our Titan-V GPU but didn't figure out why. We conjecture it is a hardware issue: in our Titan-V, --flip_test
runs faster than non-flip test. I am not aware of the pin_memory trick, thanks for pointing out this.
I am also in this dilemma, do you solve the problem?
tot 0.260s | tot 0.260s |load 0.005s | tot 0.260s |load 0.005s |pre 0.020s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s |dec 0.001s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s |dec 0.001s |post 0.001s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s |dec 0.001s |post 0.001s |merge 0.000s |
output: 2.35
opt.arch: dla-34
env:
ubuntu 16.04
TITAN V cuda 9.0
Meanwhile, i run python test.py ctdet --exp_id coco_dla and the result is:
i test the fps with the weight that trained in my own dataset, however the detection of 10 tensor(1,3,512,512) took about 2.35s, far different from the paper result. So i wonder whats wrong with the test? thx!