Open wvalcke opened 4 years ago
I have the same issue, haven't found a solution yet.
I think I just found a solution to this problem, seeing #174
I'm working on iteration 200 and it works well so far. I hope this helps :)
Hi @xvyaward
Indeed that was an error in the code. Thanks for pointing this out. The reason it only occurs sporadically is that it happens when a training image is included in the training set which is a 'hard negative'. An image on which no object should be detected. When this image passes through the network you get the error. A tensor value of 0 is added to the regression loss, which is logic if no detection is expected, but it was not moved to the CUDA device in case CUDA training is used. I trained again i did not see the error appearing anymore. Can you just confirm you have at least 1 hard negative sample as well in your dataset ? Then this issue can be closed.
I used train2017 of COCO dataset, I'm not sure whether hard negative samples are contained. I guess all the images in train2017 have at least one annotation.
Thank you @wvalcke. I removed all hard negative samples, and it worked!
But can you help make the hard negative samples work too? I want to train, involving these hard negative samples.
You need to check losses.py
if torch.cuda.is_available():
alpha_factor = torch.ones(classification.shape).cuda() * alpha
alpha_factor = 1. - alpha_factor
focal_weight = classification
focal_weight = alpha_factor * torch.pow(focal_weight, gamma)
bce = -(torch.log(1.0 - classification))
# cls_loss = focal_weight * torch.pow(bce, gamma)
cls_loss = focal_weight * bce
classification_losses.append(cls_loss.sum())
regression_losses.append(torch.tensor(0).float().cuda())
The last line (around line 62) is probably without .cuda() at the end, fix this and you can train with hard negative samples. I did this and it works correctly.
To be precise, it's line 65 where the fix is needed
Thank you~
During training i have a sporadic message : input tensors must be on the same device. Received cpu and cuda:0 At the end it does not have any influence on the training, it works correctly, i'm just wondering if somebody else have seen the same kind of warning. I checked the code, and i cannot directly see something wrong. If somebody has the same issue, and some ideas, let me know
Kind regards