xingyizhou / CenterNet

Object detection, 3D detection, and pose estimation using center point detection:
MIT License
7.25k stars 1.92k forks source link

fps doubts #247

Open Larry-C opened 5 years ago

Larry-C commented 5 years ago

def dec_fps(opt):
Dataset = dataset_factory[opt.dataset]
opt = opts().update_dataset_info_and_set_heads(opt, Dataset)
x = torch.tensor(torch.ones(1,3,512,512))
x = x.cuda()

model = create_model(opt.arch, opt.heads, opt.head_conv)
model = load_model(model, opt.load_model)
model = model.cuda()
model.eval()

start = time.time()
for i in range(10):
    output = model(x)[-1]
end = time.time()
tol = end - start
print('tol: ', tol)

if __name__ == '__main__':
    opt = opts().parse()
    dec_fps(opt)

output: 2.35

opt.arch: dla-34

env:
ubuntu 16.04
TITAN V cuda 9.0

Meanwhile, i run python test.py ctdet --exp_id coco_dla and the result is:

  Tot: 0:05:40 |ETA: 0:00:01 |tot 0.340 |load 0.031 |pre 0.015 |net 0.288 |dec 0.002 |post 0.004 |merge 0.000 

i test the fps with the weight that trained in my own dataset, however the detection of 10 tensor(1,3,512,512) took about 2.35s, far different from the paper result. So i wonder whats wrong with the test? thx!

ggsggs commented 5 years ago

Have you seen #219 and #210? They obtain slower timing than the ones reported but still much closer than yours.

It looks like your "net" time is much higher, comparing with #219

Larry-C commented 5 years ago

yes, and i dont know the reason.i also ran demo.py with the weight ctdet_coco_dla_2x downloaded in th MODEL_ZOO, it still took about 0.23s.

ggsggs commented 5 years ago

I did some tests, with a k80 (much worse than your GPU) instance and some sample images resized to 512x512:

Tot: 0:06:44 |ETA: 0:00:01 |tot 0.099s (0.101s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.096s (0.094s) |dec 0.001s (0.004s) |post 0.001s (0.002s) |merge 0.000s (0.000s)
wangg12 commented 5 years ago

Changed pin_memory to False in this line https://github.com/xingyizhou/CenterNet/blob/master/src/test.py#L62 and on my GTX 1070, run with

python src/test.py ctdet --load_model ./models/ctdet_coco_dla_2x.pth --fix_res

The result is

22.73 fps with net time 0.039s

By fixing resolution, run with

python src/test.py ctdet --load_model ./models/ctdet_coco_dla_2x.pth --fix_res

The result is

25.04 fps with net time 0.035s

When testing with pin_memory=True, it is only about 13~14 fps.

wangg12 commented 5 years ago

@ggsggs Could you check your result for pin_memory=False?

ggsggs commented 5 years ago

Yes, @wangg12: I tried both with pin_memory=False and pin_memory=False + --fix_res, per your request. The timings are similar in my case, there are no significant differences, hovering between tot = 0.095 and tot = 0.100. I am using a retrained CenterNet with just 2 classes, I do not know how relevant that is for the timing.

@Larry-C: Instead of just forwarding one tensor through the net, have you tried forwarding multiple images in successive fashion? I noticed that the timing starts decreasing with each iteration:

e_90_save_pin_mem | | [0/3978]|Tot: 0:00:00 |ETA: 0:00:00 |tot 0.311s (0.311s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.305s (0.305s)
e_90_save_pin_mem | | [1/3978]|Tot: 0:00:00 |ETA: 0:24:30 |tot 0.119s (0.215s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.115s (0.210s)
e_90_save_pin_mem | | [2/3978]|Tot: 0:00:00 |ETA: 0:16:18 |tot 0.113s (0.181s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.109s (0.176s)
e_90_save_pin_mem | | [3/3978]|Tot: 0:00:00 |ETA: 0:13:26 |tot 0.107s (0.162s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.103s (0.158s)
.
.
.
wangg12 commented 5 years ago

That is weird. Have you changed other things in the test script, like cudnn benchmark?

ggsggs commented 5 years ago

No, I haven't. Sorry, I edited my previous comment, I added a comment: I reported the timings for a retrained CenterNet, it only detects 2 classes. IDK how much that will affect the comparison.

wangg12 commented 5 years ago

I note that you posted the time for first few images, I think you should benchmark the timing by letting the network run at least a few hundred images. The gpu in the beginning is not fully burning.

ggsggs commented 5 years ago

I note that you posted the time for first few images, I think you should benchmark the timing by letting the network run at least a few hundred images. The gpu in the beginning is not fully burning.

In my previous response, I wanted to show the diminishing times for Larry-C, but I did compare pin-memory=False when the timings were stable. For your reference:

e_90_save_pin_mem |##        | [366/3978]|Tot: 0:00:37 |ETA: 0:06:07 |tot 0.097s (0.098s) |load 0.000s (0.000s) |pre 0.001s (0.001s) |net 0.093s (0.094s) |dec 0.001s (0.002s) |post 0.001s (0.001s) |merge 0.000s (0.000s) 
Larry-C commented 5 years ago

@ggsggs i run the test.py with "pin_memory=False" and "--fix_res" and the result is

  coco_dla | [37/5026]|Tot: 0:00:03 |ETA: 0:06:42 |tot 0.074s (0.077s) |load 0.000s (0.000s) |pre 0.001s (0.003s) |net 0.066s (0.067s) |dec 0.002s (0.002s) |post 0.004s (0.004s) |merge 0.000s (0.000s) 

however when i run on a new GTX TITAN the result is

  det 0.077s

still far different from the paper result also forwarding multiple images didnt help a lot. the running time didnt decrease. thx for u help!

ggsggs commented 5 years ago

@ggsggs i run the test.py with "pin_memory=False" and "--fix_res" and the result is

  coco_dla | [37/5026]|Tot: 0:00:03 |ETA: 0:06:42 |tot 0.074s (0.077s) |load 0.000s (0.000s) |pre 0.001s (0.003s) |net 0.066s (0.067s) |dec 0.002s (0.002s) |post 0.004s (0.004s) |merge 0.000s (0.000s) 

So now you are around 15 fps? At least it improved a bit, from your previous ~4-5fps. but it is still far from the paper results. I have no clue how to help with that 😟.

thx for u help!

No problem, glad I could hopefully help a bit

wangg12 commented 5 years ago

@Larry-C Which TITAN were you using? And did you benchmark the time without any other tasks running on the machine?

xingyizhou commented 5 years ago

Thanks for the discussion. We also observed abnormal testing time on our Titan-V GPU but didn't figure out why. We conjecture it is a hardware issue: in our Titan-V, --flip_test runs faster than non-flip test. I am not aware of the pin_memory trick, thanks for pointing out this.

zhangbohnu commented 5 years ago

I am also in this dilemma, do you solve the problem?

tot 0.260s | tot 0.260s |load 0.005s | tot 0.260s |load 0.005s |pre 0.020s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s |dec 0.001s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s |dec 0.001s |post 0.001s | tot 0.260s |load 0.005s |pre 0.020s |net 0.232s |dec 0.001s |post 0.001s |merge 0.000s |