wengong-jin / icml18-jtnn

Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)
MIT License
505 stars 189 forks source link

pretrain molopt or molvae not working #5

Open thegodone opened 6 years ago

thegodone commented 6 years ago

try this command in molvae or molopt: CUDA_VISIBLE_DEVICES=0 python pretrain.py --train ../data/train.txt --vocab ../data/vocab.txt \ --hidden 450 --depth 3 --latent 56 --batch 40 \ --save_dir pre_model/

for molvae: File "pretrain.py", line 69, in loss, kl_div, wacc, tacc, sacc, dacc = model(batch, beta=0) File "/usr/lib64/python2.7/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, *kwargs) File "/home/orca/icml18-jtnn/jtnn/jtnn_vae.py", line 91, in forward loss = word_loss + topo_loss + assm_loss + 2 stereo_loss + beta * kl_loss RuntimeError: std::bad_alloc

for molopt: File "pretrain.py", line 69, in loss, kl_div, wacc, tacc, sacc, dacc, pacc = model(batch, beta=0) File "/usr/lib64/python2.7/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, *kwargs) File "/home/orca/icml18-jtnn/jtnn/jtprop_vae.py", line 100, in forward loss = word_loss + topo_loss + assm_loss + 2 stereo_loss + beta * kl_loss + prop_loss RuntimeError: std::bad_alloc

my settings: redhat 7.4 / rdkit 2018_03_1 / tesla k40c card numpy-1.14.2 pyyaml-3.12 torch-0.3.1 Cuda compilation tools, release 9.0, V9.0.176

remarks: python reconstruct.py & python sample.py work without issues.

thegodone commented 6 years ago

I have the same issue using a pascal card P100 in ubuntu 14.04, python 2.7 / cuda 9.0 / rdkit 2018_3_1 installed on NVIDIA DGX-1 server.

wengong-jin commented 6 years ago

Hi, I apologize for my late reply. I tested my code on ubuntu 14.04 / python 2.7 / cuda 8.0 / pytorch 0.3.1 / rdkit 2017_09 with NVIDIA TITAN X GPU. And it runs normally.

Since the error comes from the loss computation, probably there is something wrong with cuda? Could you print out the values of word_loss, topo_loss, etc. to see what's the problem?

pyeguy commented 6 years ago

I'm getting this too with ubuntu 18.04/python 2.7/ cudatoolkit 8.0 / pytorch 0.3.1 /rdkit 2018.03.1.0

looks like the stereo_loss is dropping out to a tensor of size 0 (look at printed output below the traceback). could that be the problem? I'm a pytorch noob...

word_loss
Variable containing:
 51.4962
[torch.cuda.FloatTensor of size 1 (GPU 0)]

topo_loss
Variable containing:
 13.5945
[torch.cuda.FloatTensor of size 1 (GPU 0)]

assm_loss
Variable containing:
 4.2711
[torch.cuda.FloatTensor of size 1 (GPU 0)]

stereo_loss
Variable containing:
 1.3863
[torch.cuda.FloatTensor of size 1 (GPU 0)]

beta
0
kl_loss
Variable containing:
 266.0024
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Traceback (most recent call last):
  File "pretrain.py", line 69, in <module>
    loss, kl_div, wacc, tacc, sacc, dacc = model(batch, beta=0)
  File "/home/cpye/anaconda3/envs/molvae/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/cpye/pydev/icml18-jtnn/jtnn/jtnn_vae.py", line 104, in forward
    loss = word_loss + topo_loss + assm_loss + 2 * stereo_loss + beta * kl_loss 
RuntimeError: std::bad_alloc
word_loss
Variable containing:
 44.3971
[torch.cuda.FloatTensor of size 1 (GPU 0)]

topo_loss
Variable containing:
 11.4075
[torch.cuda.FloatTensor of size 1 (GPU 0)]

assm_loss
Variable containing:
 4.2267
[torch.cuda.FloatTensor of size 1 (GPU 0)]

stereo_loss
Variable containing:[torch.cuda.FloatTensor with no dimension]

beta
0
kl_loss
Variable containing:
 224.0390
[torch.cuda.FloatTensor of size 1 (GPU 0)]
Oktai15 commented 6 years ago

@pyeguy @thegodone in jtnn_vae.py replace:

if len(labels) == 0: return create_var(torch.Tensor(0)), 1.0

with

if len(labels) == 0: return create_var(torch.Tensor([0])), 1.0

NamanChuriwala commented 5 years ago

in jtnn_vae.py replace:

if len(labels) == 0: return create_var(torch.Tensor(0)), 1.0

with

if len(labels) == 0: return create_var(torch.Tensor([0])), 1.0

Hi Oktai15, That works perfectly. Thanks for the response. May I know why this was causing stereo function to return a null tensor value?