Training Resnest50 backbone in KeypointRCNN has large loss value??

ztrobertyang commented 3 years ago

Hello,

I am interested in the ResNeSt and find your source in here github. I find this code is modified base on the Pytorch Resnet source code. I guess this may be useful for Keypoint RCNN in the Pytorch function from this: (link here)

This keypoint RCNN using Mask RCNN to get kepoints of human body. The link above shows how to use "resnet50" with FPN to combine the keypoint RCNN to detect keypoints. I try to import this ResNeSt to the function of "resnet_fpn_backbone()'. This is to add FPN to the backbone, then the backbone can be import to the KeypointRCNN function. I modify the "resnet_fpn_backbone()". The source code is: in here. I remove the code of:

backbone = resnet.__dict__[backbone_name](pretrained=pretrained, norm_layer=norm_layer)

Then, I add code of:

from resnest.torch import resnest50 backbone = resnest50(pretrained=True, norm_layer=normlayer)

after that, I load my human keypoint data and train the KeyppointRCNN model with plan below:

learning rate: 0.01
learning reate schedule: 60 epoch later reduce 1/10
backbone trainable layer: 1
backbone using pre-trained
train 200 epoches

According to my plan, I train the keypoint head, keypoint predictor and backbone layer 4. I got a problem on "resnest50". The training loss is very large at the beginning and the training is stop. I show the part of training matrix below:

Epoch: [1] [ 0/415] eta: 0:19:05 lr: 0.020000 loss: 9694228665860096.0000 (9694228665860096.0000) loss_classifier: 1607300510908416.0000 (1607300510908416.0000) loss_box_reg: 1338518907387904.0000 (1338518907387904.0000) loss_keypoint: 6723557090394112.0000 (6723557090394112.0000) loss_objectness: 8629163393024.0000 (8629163393024.0000) loss_rpn_box_reg: 16223118032896.0000 (16223118032896.0000) backbone_lr: 0.0020 (0.0020) time: 2.7608 data: 1.8687 max mem: 5360 Epoch: [1] [400/415] eta: 0:00:15 lr: 0.020000 loss: 290602221568.0000 (7552529166450680.0000) loss_classifier: 2527694848.0000 (796015671153475.2500) loss_box_reg: 4396419072.0000 (588817204939649.2500) loss_keypoint: 11655905280.0000 (3677645552211085.0000) loss_objectness: 13333134.0000 (1833580384001665.5000) loss_rpn_box_reg: 19661014.0000 (656470306177560.1250) backbone_lr: 0.0020 (0.0020) time: 1.0459 data: 0.0146 max mem: 5374 Epoch: [1] [414/415] eta: 0:00:01 lr: 0.020000 loss: 10113544.0000 (7307309145100474.0000) loss_classifier: 1084518.1250 (771343789305753.6250) loss_box_reg: 827791.3125 (571295473125373.3750) loss_keypoint: 1400615.0000 (3556088035860469.5000) loss_objectness: 2333587.5000 (1773060066997531.2500) loss_rpn_box_reg: 1333699.0000 (635521733803018.2500) backbone_lr: 0.0020 (0.0020) time: 1.0415 data: 0.0142 max mem: 5374 Epoch: [1] Total time: 0:07:11 (1.0408 s / it) Validation: [ 0/100] eta: 0:00:42 loss: 3432679328448512.0000 (3432679328448512.0000) loss_classifier: 885848748851200.0000 (885848748851200.0000) loss_box_reg: 2463024157818880.0000 (2463024157818880.0000) loss_keypoint: 83709944397824.0000 (83709944397824.0000) loss_objectness: 18419695616.0000 (18419695616.0000) loss_rpn_box_reg: 77873094656.0000 (77873094656.0000) pixDist: 0.0000 (0.0000) model_time: 0.1368 (0.1368) time: 0.4294 data: 0.2891 max mem: 5374 Validation: [ 99/100] eta: 0:00:00 loss: 6945174.5000 (97094747778534.3750) loss_classifier: 2456236.0000 (13676220802396.5918) loss_box_reg: 2116573.5000 (29444467474317.7188) loss_keypoint: 83345.1797 (5494503879989.7949) loss_objectness: 310360.0000 (47935150848000.5703) loss_rpn_box_reg: 158117.8438 (544403231660.1614) pixDist: 0.0000 (0.0000) model_time: 0.0909 (0.1050) time: 0.1035 data: 0.0030 max mem: 5374 Validation: Total time: 0:00:11 (0.1138 s / it) Averaged stats: loss: 6945174.5000 (97094747778534.3750) loss_classifier: 2456236.0000 (13676220802396.5918) loss_box_reg: 2116573.5000 (29444467474317.7188) loss_keypoint: 83345.1797 (5494503879989.7949) loss_objectness: 310360.0000 (47935150848000.5703) loss_rpn_box_reg: 158117.8438 (544403231660.1614) pixDist: 0.0000 (0.0000) model_time: 0.0909 (0.1050)

It can be seen that the training loss has very large value, for example, in "Epoch: [1] [0/415]" the loss is "9694228665860096.000 (9694228665860096.0000)". The training will be stop, because the loss goes to NaN. I guess maybe other freezing layers of the backbone may generate the problem. Then, I set the "backbone trainable layer" to 2. Then, I use learning rate 0.001. However, the keypoint RCNN stops training in the first epoch. The reason is still loss of NaN. I cannot understand this. Do you have any idea of reasons?

zhanghang1989 commented 3 years ago

The easiest way to train a keypoint detector with ResNeSt may be using d2 wrapper https://github.com/zhanghang1989/ResNeSt/tree/master/d2

I would recommend try using that.

ztrobertyang commented 3 years ago

Hi,

Because of some development issues, such as data pipe line and loss function, I have to use Pytorch. I never use Detetron2. If I use d2 to train the KeypointRCNN using my data, do you think it can produce the weight file which can be loaded by Pytorch Keypoint RCNN function? If the weight can be used by Pytorch Keypoint RCNN function, I may try it.

zhanghang1989 commented 3 years ago

detectron2 is built upon pytorch . The implementation is also similar to torchvision one.

zhanghang1989 / ResNeSt

Training Resnest50 backbone in KeypointRCNN has large loss value?? #147