Training loss = nan - Githubissues

zhiyi-set commented 1 year ago

Thank you for providing the codes, they are so excellent. But I met a problem, when I run _train_tradescifar10.py using resnet, the training loss is always nan.

What I modified is the epsilon size, I used epsilon = 2/255, 4/255, 8/255, and 16/255. Do you know the reason? Thank you again

Zaid-Hameed commented 1 year ago

Hi Zhiyi,

Did you find the cause behind this error or any solution?

Thanks!

zhiyi-set commented 1 year ago

@Zaid-Hameed No, I try different seeds/initialization to train the model, and I got numerical loss only once or twice, most of the time it showed the nan loss. I didn't figure out the reasons, but you can try this way.

Zaid-Hameed commented 1 year ago

@Zhiyi-Dong-6 In my experiments, this issue occurs for pytorch releases >= 1.13 and is most likely happening because of modification in KLDivLoss implementation in pytorch. A very simple work around by either adding a small epsilon (~1e-8) or clamping minimum values by this epsilon in "trades.py" like this

loss_robust = (1.0 / batch_size) * criterion_kl(F.log_softmax(model(x_adv), dim=1),
                                                    torch.clamp(F.softmax(model(x_natural), dim=1), min=EPS))

solves the issue. For details on this issue in pytorch please checkout this link: https://github.com/pytorch/pytorch/issues/89558

Hope it helps!

zhiyi-set commented 1 year ago

@Zaid-Hameed Thanks!

yaodongyu / TRADES

Training loss = nan #30