RuntimeError: The size of tensor a (152) must match the size of tensor b (150) at non-singleton dimension 3

Kongsea commented 4 years ago

I train my custom dataset using this pytorch centernet, but the following error was raised:

RuntimeError: The size of tensor a (152) must match the size of tensor b (150) at non-singleton dimension 3

Traceback (most recent call last):
  File "train.py", line 238, in <module>
    main()
  File "train.py", line 227, in main
    train(epoch)
  File "train.py", line 149, in train
    hmap_loss = _neg_loss(hmap, batch['hmap'])
  File "pytorch_simple_CenterNet_45/utils/losses.py", line 47, in _neg_loss
    pos_loss = torch.log(pred) * torch.pow(1 - pred, 2) * pos_inds
RuntimeError: The size of tensor a (152) must match the size of tensor b (150) at non-singleton dimension 3

Could you help me to fix it? Thank you.

Kongsea commented 4 years ago

I trained the model using this configs:

'--dataset', 'coco', '--arch', 'resdcn_101', '--img_size', '600'

zzzxxxttt commented 4 years ago

The img_size should be divisible by 32 when using ResDCN backbone, you could try img_size=576 or 608.

Kongsea commented 4 years ago

Thank you. Now the above error has been fixed.

However, RuntimeError: CUDA out of memory. when trained using an img_size 608 or even smaller. Could you help me again to fix it? Thank you. @zzzxxxttt

Kongsea commented 4 years ago

I used 4*1080Ti GPU to train the model.

zzzxxxttt commented 4 years ago

Using smaller batchsize may help.

Kongsea commented 4 years ago

I have used the smallest batchsize 4 with 4 gpus, but the error still was raised...

zzzxxxttt commented 4 years ago

Well, if one image per GPU still causes OOM, the last choice is to use smaller backbones.

Kongsea commented 4 years ago

I have also tried small_hourglass and resdcn50, but the OOM still exists... @zzzxxxttt

zzzxxxttt commented 4 years ago

Try to train the model on COCO with your img_size and batchsize setting, if no error occurs, then there must be something wrong in your modification.

Kongsea commented 4 years ago

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "pytorch_simple_CenterNet_45/nets/resdcn.py", line 227, in forward
    x = self.layer3(x)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "pytorch_simple_CenterNet_45/nets/resdcn.py", line 71, in forward
    out = self.conv1(x)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 2.13 GiB (GPU 0; 10.92 GiB total capacity; 257.03 MiB already allocated; 1.35 GiB free; 14.97 MiB cached)

[1/10-0/3459]  hmap_loss= 2.47419 reg_loss= 0.48312 w_h_loss= 14.17547 (19 samples/sec) [2020-08-06 16:18:24,319]
[1/10-100/3459]  hmap_loss= 2.31409 reg_loss= 0.23472 w_h_loss= 13.98645 (10 samples/sec) [2020-08-06 16:19:01,029]
[1/10-200/3459]  hmap_loss= 2.20654 reg_loss= 0.27148 w_h_loss= 8.71340 (11 samples/sec) [2020-08-06 16:19:34,465]
[1/10-300/3459]  hmap_loss= 2.31515 reg_loss= 0.23678 w_h_loss= 2.86637 (12 samples/sec) [2020-08-06 16:20:06,034]
[1/10-400/3459]  hmap_loss= 2.37423 reg_loss= 0.24649 w_h_loss= 12.40236 (13 samples/sec) [2020-08-06 16:20:36,087]
[1/10-500/3459]  hmap_loss= 2.41782 reg_loss= 0.28343 w_h_loss= 6.85204 (13 samples/sec) [2020-08-06 16:21:05,488]
[1/10-600/3459]  hmap_loss= 2.33895 reg_loss= 0.22797 w_h_loss= 5.50138 (12 samples/sec) [2020-08-06 16:21:36,287]
[1/10-700/3459]  hmap_loss= 2.32114 reg_loss= 0.24712 w_h_loss= 5.93947 (13 samples/sec) [2020-08-06 16:22:05,832]
[1/10-800/3459]  hmap_loss= 2.02382 reg_loss= 0.25070 w_h_loss= 5.02053 (12 samples/sec) [2020-08-06 16:22:36,965]
[1/10-900/3459]  hmap_loss= 2.38114 reg_loss= 0.24117 w_h_loss= 6.88888 (12 samples/sec) [2020-08-06 16:23:08,835]

Although the OOM error was raised, the training process have been continued...and no new error was raised ..... The GPU usage is no more than 3 GB....on all 4 gpus...

zzzxxxttt commented 4 years ago

try to set this to False in train.py line #79 torch.backends.cudnn.benchmark = True # disable this if OOM at beginning of training

Kongsea commented 4 years ago

Thank you @zzzxxxttt . The error has been fixed after set it to False.

zzzxxxttt / pytorch_simple_CenterNet_45

RuntimeError: The size of tensor a (152) must match the size of tensor b (150) at non-singleton dimension 3 #19