Closed Kiwis2012 closed 3 years ago
try this sentense:
torch.autograd.set_detect_anomaly(True)
AND What version of PyTorch do you use? Can you try pytorch-1.2?
Thanks for the reply but sorry i can't try pytorch-1.2, i'm using python 3.8.5 & torch 1.7.1, torchvision 0.8.2, but i will try older python later
And that was the message after i set torch.autograd.set_detect_anomaly(True), if without this sentence, the message would be like:
Traceback (most recent call last):
File "train.py", line 527, in
You can try the following two solutions:
If there is any result, we can continue to talk about it.
Hello i tried 'inplace=False' in all relu and tried some .clone but didn't work, and at last i changed a part of code in train.py, like this: l1_copy = enc(inputs + n) logits_Dl_l1_copy = disc_l(l1_copy.view(l1_copy.size(0),32,3,3)) ones_logits_Dl_l1 = Variable(Tensor(logits_Dll1.shape[0], 1).fill(1.0), requires_grad=False) loss_AE_l = criterion_ce(logits_Dl_l1_copy,ones_logits_Dl_l1)
the original code was: ones_logits_Dl_l1 = Variable(Tensor(logits_Dll1.shape[0], 1).fill(1.0), requires_grad=False) loss_AE_l = criterion_ce(logits_Dl_l1,ones_logits_Dl_l1)
and at least it's running now.
How do you think about it? will it cause any potential problem? Thank you.
How about the result? AND How did you get l1_copy? l1_copy is the same as l1?
Yep, by now the best acc is 0.9717 in testing, l1_copy = enc(inputs + n)
Sorry to add up on this one, but I did a bit of additional backtracking to understand exactly what went wrong. If we look at the initial stack trace, the error happens when loss_ae_all.backward()
computes gradients from the disc_l
model. The problem is that we have the following sequence:
logits_Dl_l1
is computed using disc_l
(l363)disc_l
model is updated (l374)loss_AE_l
uses logits_Dl_l1
(l408), then combined into loss_ae_all
.So the logits_Dl_l1
tensor on l408 depends on disc_l
weight values which have been updated in the meantime. So naturally, backward
is not happy when it tries to compute the gradients.
In the end, the solution by @Kiwis2012 works, but it is slightly overkill. The minimal change would be to insert: logits_Dl_l1 = disc_l(disc_l_l1)
just before l408 so that logits_Dl_l1
is refreshed w.r.t. the updated disc_l
parameters.
Hello I was trying to use 'python train.py' to run the script, and interrupted by this runtime error:
[W python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward. Traceback of forward call that caused the error: File "train.py", line 527, in
main()
File "train.py", line 190, in main
train_loss = train(args,dataloader['train'], enc, dec,cl,disc_l,disc_v,
File "train.py", line 366, in train
logits_Dl_l1 = disc_l(disc_l_l1)
File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/home/pankiwi/src/OCGAN-Pytorch/ocgan/networks.py", line 164, in forward
output = self.dense_5(output)
File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(input, **kwargs)
File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
(function _print_stack)
Traceback (most recent call last):
File "train.py", line 527, in
main()
File "train.py", line 190, in main
train_loss = train(args,dataloader['train'], enc, dec,cl,disc_l,disc_v,
File "train.py", line 421, in train
loss_ae_all.backward()
File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/pankiwi/anaconda3/envs/kiwi_base/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
It looks like some variables were modified before backward? Do you have any idea about this? Thank you in advance~~