Closed convnets closed 3 years ago
It seems that the issue is this code snippet:
seg_label_copy = torch.squeeze(seg_label_tensor.clone())
bg_label = seg_label_copy.clone()
fg_label = seg_label_copy.clone()
bg_label[seg_label_copy != 0] = 255
fg_label[seg_label_copy == 0] = 255
bg_celoss = critersion(pred, bg_label.long().cuda())
fg_celoss = critersion(pred, fg_label.long().cuda())
celoss = bg_celoss + fg_celoss
Separating bg_celoss and fg_celoss makes training very unstable. I'm not sure why using the criterion function twice on pred tensor would result in loss NaN. Hope the author could explain why you design your code in such a way. @zbf1991
Maybe in some cases, there is no bg/fg seed mask in the pseudo label from the classification branch, which makes the prediction unstable, I have tried to use single celoss, but the performance is not satisfied. Maybe give a larger batch size can make the model be more stable. Sorry about that.
I trained the network with train_cls_weight.py and initialized the model with res38_cls.pth you provided. But the loss encounter NAN, can you help check why ?