Problem in training the ResNet-18 model: Loss-nan

luisfernandes9 commented 2 years ago

Hello, i am trying to reproduce the results from the paper but during the training of the ResNet-18 model, the Loss values suddenly return nan (see figure attatched). I think is due to the fact that after some time, the gradient matrix also returns nan. Any help in resolving this issue would be very appreciated. Thanks :) error

zhiCHEN96 commented 2 years ago

Thanks a lot for reporting the problem. Would you provide more details? For example, which script were you running, which layer did you apply CW for and what concepts were chosen?

luisfernandes9 commented 2 years ago

Hello, i used the script train_places.py with the added code line:

args = parser.parse_args(["--ngpu", "1", "--workers","4","--arch","resnet_cw","--depth","18","--epochs","200","--batch-size","64","--lr","0.05", "--whitened_layers", "5", "--concepts","airplane,bed,person","--prefix","RESNET18_PLACES365_CPT_WHITEN_TRANSFER", "data_256"])

I am using spyder to run the scripts and thats why I added the new code line. The error does not always appear when we apply the CW to one layer, but appears always when we apply the CW to more than one layer. For example if we change the code to: "--whitened_layers", "1,5,7,8". This will result in the following error.

error 2

zhiCHEN96 commented 2 years ago

Thanks. Actually, our current code CANNOT deal with multiple CW layers --- the model can only have one CW layer. I've changed the README file to make this clear.

zhiCHEN96 / ConceptWhitening

Problem in training the ResNet-18 model: Loss-nan #12