sample step manipulation

AKASH2907 commented 3 years ago

If I change the sample step from 1,2,4,8 to 1,2,4 or 1,2, do I need to modify some lines in pat_region.py. If so, can you point it out? I changed the sample step list and again the loss nan error starts coming in each epoch at some iteration over a batch. Using kinetics dataset for pre-training.

yuanyao366 commented 3 years ago

No, you needn't . Did you modify the num_classes in train_predict.py line 286 according to the sample step list ?

AKASH2907 commented 3 years ago

The error is still present but it's happening only in the validation, training loop is working fine

tensor value at 3, 4 is nan only in each validation.

Epoch: [5][100/100] data_time:0.025,batch time:0.311 loss:0.13519 loss_recon:0.02778 loss_class:1.07413 accuracy:43.625 [VAL] loss_cls: 1.074, acc: 0.436 tensor([373., 302., 23., 0.]) tensor([579., 500., 521., 0.]) tensor([0.6442, 0.6040, 0.0441, nan]) WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

AKASH2907 commented 3 years ago

There's one additional change at lines 96, 97, 193, and 194 when the sample step changes from 1,2,4,8 to 1,2,4 or 1,2. Need to modify the torch.zeros to the number of classes or length of sample step which is 3 and 2 respectively for the above cases. The training I don't think its gonna be much effect but it will just stop popping NaN in your training logs.

yuanyao366 commented 3 years ago

If you modify the line 96, 97, 193, 194 and 286 of train_predict.py according to your sample step, will there still be 'loss NAN' ?

AKASH2907 commented 3 years ago

no, the loss nan was generated because the last element of torch.zeros was zero for both correct_clas_cnt and total_cls_cnt. 0/0 = NaN. if we modify the size according to the length of the sample list then the training is going fine without any warnings

yuanyao366 / PRP

sample step manipulation #6