weird error when training on gpu

wiseodd / controlled-text-generation

Reproducing Hu, et. al., ICML 2017's "Toward Controlled Generation of Text"

BSD 3-Clause "New" or "Revised" License

242 stars 63 forks source link

weird error when training on gpu #14

Closed edchengg closed 6 years ago

edchengg commented 6 years ago

Traceback (most recent call last): File "train_discriminator.py", line 308, in main() File "train_discriminator.py", line 239, in main loss_G.backward() File "/home-nfs/yangc1/anaconda3/envs/speech/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home-nfs/yangc1/anaconda3/envs/speech/lib/python3.6/site-packages/torch/autograd/init.py", line 89, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: backward_input can only be called in training mode

Hi, this code works fine on CPU. However, a weird error occurs doing GPU training. I have checked the model is on training mode as well.

wiseodd commented 6 years ago

It runs just fine in my env (PyTorch 0.3) in GPU. I'm suspicious that this is because of version 0.4 of PyTorch.

edchengg commented 6 years ago

Thanks for you reply! Can you please see the following response from a pytorch team member? https://github.com/pytorch/pytorch/issues/7961#event-1654524045

I don't see there is a bug since it runs fine with cpu in 0.4.

wiseodd commented 6 years ago

Maybe adding model.train() below this line (also in train_discriminator.py) will do. https://github.com/wiseodd/controlled-text-generation/blob/a09ad1d9272b9e40fc0084ad1779a6d44f07862d/train_vae.py#L58

I do not have PyTorch 0.4 so I cannot test it. Would you mind to try?

TobiasLee commented 6 years ago

@wiseodd I've met this problem, even though I followed your instruction add model.train() below the loop. It occurs at 0.4 PyTorch in GPU, but works fine with CPU

edchengg commented 6 years ago

@TobiasLee I removed all the model.eval() and it works fine in GPU. Maybe the sleep-wake algorithm needs a different approach to be implemented.Anyway, the generated results seems ok for me. See my repo NLP project for the results.