Closed Kongsea closed 4 years ago
I trained the model using this configs:
'--dataset', 'coco', '--arch', 'resdcn_101', '--img_size', '600'
The img_size should be divisible by 32 when using ResDCN backbone, you could try img_size=576 or 608.
Thank you. Now the above error has been fixed.
However, RuntimeError: CUDA out of memory.
when trained using an img_size 608 or even smaller. Could you help me again to fix it? Thank you. @zzzxxxttt
I used 4*1080Ti GPU to train the model.
Using smaller batchsize may help.
I have used the smallest batchsize 4 with 4 gpus, but the error still was raised...
Well, if one image per GPU still causes OOM, the last choice is to use smaller backbones.
I have also tried small_hourglass
and resdcn50
, but the OOM still exists... @zzzxxxttt
Try to train the model on COCO with your img_size and batchsize setting, if no error occurs, then there must be something wrong in your modification.
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "pytorch_simple_CenterNet_45/nets/resdcn.py", line 227, in forward
x = self.layer3(x)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "pytorch_simple_CenterNet_45/nets/resdcn.py", line 71, in forward
out = self.conv1(x)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
return self.conv2d_forward(input, self.weight)
File "anaconda/envs/mb/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 2.13 GiB (GPU 0; 10.92 GiB total capacity; 257.03 MiB already allocated; 1.35 GiB free; 14.97 MiB cached)
[1/10-0/3459] hmap_loss= 2.47419 reg_loss= 0.48312 w_h_loss= 14.17547 (19 samples/sec) [2020-08-06 16:18:24,319]
[1/10-100/3459] hmap_loss= 2.31409 reg_loss= 0.23472 w_h_loss= 13.98645 (10 samples/sec) [2020-08-06 16:19:01,029]
[1/10-200/3459] hmap_loss= 2.20654 reg_loss= 0.27148 w_h_loss= 8.71340 (11 samples/sec) [2020-08-06 16:19:34,465]
[1/10-300/3459] hmap_loss= 2.31515 reg_loss= 0.23678 w_h_loss= 2.86637 (12 samples/sec) [2020-08-06 16:20:06,034]
[1/10-400/3459] hmap_loss= 2.37423 reg_loss= 0.24649 w_h_loss= 12.40236 (13 samples/sec) [2020-08-06 16:20:36,087]
[1/10-500/3459] hmap_loss= 2.41782 reg_loss= 0.28343 w_h_loss= 6.85204 (13 samples/sec) [2020-08-06 16:21:05,488]
[1/10-600/3459] hmap_loss= 2.33895 reg_loss= 0.22797 w_h_loss= 5.50138 (12 samples/sec) [2020-08-06 16:21:36,287]
[1/10-700/3459] hmap_loss= 2.32114 reg_loss= 0.24712 w_h_loss= 5.93947 (13 samples/sec) [2020-08-06 16:22:05,832]
[1/10-800/3459] hmap_loss= 2.02382 reg_loss= 0.25070 w_h_loss= 5.02053 (12 samples/sec) [2020-08-06 16:22:36,965]
[1/10-900/3459] hmap_loss= 2.38114 reg_loss= 0.24117 w_h_loss= 6.88888 (12 samples/sec) [2020-08-06 16:23:08,835]
Although the OOM error was raised, the training process have been continued...and no new error was raised ..... The GPU usage is no more than 3 GB....on all 4 gpus...
try to set this to False in train.py line #79
torch.backends.cudnn.benchmark = True # disable this if OOM at beginning of training
Thank you @zzzxxxttt . The error has been fixed after set it to False.
I train my custom dataset using this pytorch centernet, but the following error was raised:
RuntimeError: The size of tensor a (152) must match the size of tensor b (150) at non-singleton dimension 3
Could you help me to fix it? Thank you.