yuanyao366 / PRP

Apache License 2.0
40 stars 10 forks source link

Loss NaN error #4

Closed AKASH2907 closed 3 years ago

AKASH2907 commented 3 years ago

Hi,

I was running your code and after few epochs, nan loss started appearing. I'm sharing it from epoch 99 but it started appearing with epoch 5 or so.


Epoch:[99][200/278] data_time:0.128,batch time:1.571 loss:nan loss_recon:nan loss_class:nan accuracy:27.125 [TRAIN] loss_cls: nan, acc: 0.266 tensor([2367., 0., 0., 0.]) tensor([2367., 2146., 2168., 2215.]) tensor([1., 0., 0., 0.]) 33%|3 | 99/300 [27:19:23<56:29:50, 1011.89s/it][VAL] loss_cls: nan, acc: 0.292 tensor([467., 0., 0., 0.]) tensor([467., 363., 385., 385.]) tensor([1., 0., 0., 0.]) WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

conv_lr:0.001 fc8_lr:0.010000000000000002 Epoch:[100][100/278] data_time:0.126,batch time:1.425 loss:nan loss_recon:nan loss_class:nan accuracy:26.969

I'm unable to figure out where did I went wrong? What should I modify? I'm working on the Kinetics dataset

When I trained on UCF 101 dataset, this didn't happen. I checked for 90 epochs and the pretext task accuracy also increased, here it's stuck at 26%.

yuanyao366 commented 3 years ago

Oh, yes, there will be this error when training the Kinetics-400 dataset, and I have updated the code to fix the error.

AKASH2907 commented 3 years ago

ok, thanks for the update i'll check it out