yhenon / pytorch-retinanet

Pytorch implementation of RetinaNet object detection.
Apache License 2.0
2.15k stars 665 forks source link

Sporadic message : input tensors must be on the same device. Received cpu and cuda:0 #197

Open wvalcke opened 4 years ago

wvalcke commented 4 years ago

During training i have a sporadic message : input tensors must be on the same device. Received cpu and cuda:0 At the end it does not have any influence on the training, it works correctly, i'm just wondering if somebody else have seen the same kind of warning. I checked the code, and i cannot directly see something wrong. If somebody has the same issue, and some ideas, let me know

Kind regards

xvyaward commented 4 years ago

I have the same issue, haven't found a solution yet.

xvyaward commented 4 years ago

I think I just found a solution to this problem, seeing #174

I'm working on iteration 200 and it works well so far. I hope this helps :)

wvalcke commented 4 years ago

Hi @xvyaward

Indeed that was an error in the code. Thanks for pointing this out. The reason it only occurs sporadically is that it happens when a training image is included in the training set which is a 'hard negative'. An image on which no object should be detected. When this image passes through the network you get the error. A tensor value of 0 is added to the regression loss, which is logic if no detection is expected, but it was not moved to the CUDA device in case CUDA training is used. I trained again i did not see the error appearing anymore. Can you just confirm you have at least 1 hard negative sample as well in your dataset ? Then this issue can be closed.

xvyaward commented 4 years ago

I used train2017 of COCO dataset, I'm not sure whether hard negative samples are contained. I guess all the images in train2017 have at least one annotation.

ccl-private commented 3 years ago

Thank you @wvalcke. I removed all hard negative samples, and it worked!

But can you help make the hard negative samples work too? I want to train, involving these hard negative samples.

wvalcke commented 3 years ago

You need to check losses.py

                if torch.cuda.is_available():
                    alpha_factor = torch.ones(classification.shape).cuda() * alpha

                    alpha_factor = 1. - alpha_factor
                    focal_weight = classification
                    focal_weight = alpha_factor * torch.pow(focal_weight, gamma)

                    bce = -(torch.log(1.0 - classification))

                    # cls_loss = focal_weight * torch.pow(bce, gamma)
                    cls_loss = focal_weight * bce
                    classification_losses.append(cls_loss.sum())
                    regression_losses.append(torch.tensor(0).float().cuda())

The last line (around line 62) is probably without .cuda() at the end, fix this and you can train with hard negative samples. I did this and it works correctly.

wvalcke commented 3 years ago

To be precise, it's line 65 where the fix is needed

ccl-private commented 3 years ago

Thank you~