Closed zhiyi-set closed 1 year ago
Hi Zhiyi,
Did you find the cause behind this error or any solution?
Thanks!
@Zaid-Hameed No, I try different seeds/initialization to train the model, and I got numerical loss only once or twice, most of the time it showed the nan loss. I didn't figure out the reasons, but you can try this way.
@Zhiyi-Dong-6 In my experiments, this issue occurs for pytorch releases >= 1.13 and is most likely happening because of modification in KLDivLoss implementation in pytorch. A very simple work around by either adding a small epsilon (~1e-8) or clamping minimum values by this epsilon in "trades.py" like this
loss_robust = (1.0 / batch_size) * criterion_kl(F.log_softmax(model(x_adv), dim=1),
torch.clamp(F.softmax(model(x_natural), dim=1), min=EPS))
solves the issue. For details on this issue in pytorch please checkout this link: https://github.com/pytorch/pytorch/issues/89558
Hope it helps!
@Zaid-Hameed Thanks!
Thank you for providing the codes, they are so excellent. But I met a problem, when I run _train_tradescifar10.py using resnet, the training loss is always nan.
What I modified is the epsilon size, I used epsilon = 2/255, 4/255, 8/255, and 16/255. Do you know the reason? Thank you again