Runtime error when running 21st epoch

Kiwis2012 commented 3 years ago

Hello I was trying to use 'python train.py' to run the script, and interrupted by this runtime error:

[W python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward. Traceback of forward call that caused the error: File "train.py", line 527, in main() File "train.py", line 190, in main train_loss = train(args,dataloader['train'], enc, dec,cl,disc_l,disc_v, File "train.py", line 366, in train logits_Dl_l1 = disc_l(disc_l_l1) File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, *kwargs) File "/home/pankiwi/src/OCGAN-Pytorch/ocgan/networks.py", line 164, in forward output = self.dense_5(output) File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward return F.linear(input, self.weight, self.bias) File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear ret = torch.addmm(bias, input, weight.t()) (function _print_stack) Traceback (most recent call last): File "train.py", line 527, in main() File "train.py", line 190, in main train_loss = train(args,dataloader['train'], enc, dec,cl,disc_l,disc_v, File "train.py", line 421, in train loss_ae_all.backward() File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

It looks like some variables were modified before backward? Do you have any idea about this? Thank you in advance~~

xiehousen commented 3 years ago

try this sentense: torch.autograd.set_detect_anomaly(True) AND What version of PyTorch do you use? Can you try pytorch-1.2?

Kiwis2012 commented 3 years ago

Thanks for the reply but sorry i can't try pytorch-1.2, i'm using python 3.8.5 & torch 1.7.1, torchvision 0.8.2, but i will try older python later And that was the message after i set torch.autograd.set_detect_anomaly(True), if without this sentence, the message would be like: Traceback (most recent call last): File "train.py", line 527, in main() File "train.py", line 190, in main train_loss = train(args,dataloader['train'], enc, dec,cl,disc_l,disc_v, File "train.py", line 421, in train loss_ae_all.backward() File "/home/pankiwi/anaconda3/envs/ocgan_torch_test/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/pankiwi/anaconda3/envs/ocgan_torch_test/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

xiehousen commented 3 years ago

You can try the following two solutions:

you can set inplace=False in nn.ReLU and nn.LeakyReLU.
you can use .clone() in the loss_ae_all part(Some of the losses are added up)

If there is any result, we can continue to talk about it.

Kiwis2012 commented 3 years ago

Hello i tried 'inplace=False' in all relu and tried some .clone but didn't work, and at last i changed a part of code in train.py, like this: l1_copy = enc(inputs + n) logits_Dl_l1_copy = disc_l(l1_copy.view(l1_copy.size(0),32,3,3)) ones_logits_Dl_l1 = Variable(Tensor(logits_Dll1.shape[0], 1).fill(1.0), requires_grad=False) loss_AE_l = criterion_ce(logits_Dl_l1_copy,ones_logits_Dl_l1)

the original code was: ones_logits_Dl_l1 = Variable(Tensor(logits_Dll1.shape[0], 1).fill(1.0), requires_grad=False) loss_AE_l = criterion_ce(logits_Dl_l1,ones_logits_Dl_l1)

and at least it's running now.

How do you think about it? will it cause any potential problem? Thank you.

xiehousen commented 3 years ago

How about the result? AND How did you get l1_copy? l1_copy is the same as l1?

Kiwis2012 commented 3 years ago

Yep, by now the best acc is 0.9717 in testing, l1_copy = enc(inputs + n)

pbruneau commented 2 years ago

Sorry to add up on this one, but I did a bit of additional backtracking to understand exactly what went wrong. If we look at the initial stack trace, the error happens when loss_ae_all.backward() computes gradients from the disc_l model. The problem is that we have the following sequence:

logits_Dl_l1 is computed using disc_l (l363)
disc_l model is updated (l374)
loss_AE_l uses logits_Dl_l1 (l408), then combined into loss_ae_all.

So the logits_Dl_l1 tensor on l408 depends on disc_l weight values which have been updated in the meantime. So naturally, backward is not happy when it tries to compute the gradients.

In the end, the solution by @Kiwis2012 works, but it is slightly overkill. The minimal change would be to insert: logits_Dl_l1 = disc_l(disc_l_l1) just before l408 so that logits_Dl_l1 is refreshed w.r.t. the updated disc_l parameters.

xiehousen / OCGAN-Pytorch

Runtime error when running 21st epoch #2