wengong-jin / hgraph2graph

Hierarchical Generation of Molecular Graphs using Structural Motifs
MIT License
367 stars 108 forks source link

README error #34

Open muammar opened 2 years ago

muammar commented 2 years ago

After you generate the vocabulary in the first step of the README,

python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt 

the next line should be:

python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single

Otherwise, you get the following error:

IndexError: tuple index out of range
orubaba commented 2 years ago

I have a question. how long does it take for the training to conclude. I have been running the "python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single" for a whole day and it has not completed. Is there something I am doing wrongly?

muammar commented 2 years ago

I have a question. how long does it take for the training to conclude. I have been running the "python preprocess.py --train data/chembl/all.txt --vocab vocab.txt --ncpu 16 --mode single" for a whole day and it has not completed. Is there something I am doing wrongly?

That's not normal. It took a couple of hours. I had to change the number of CPUs used because it was killing the memory ram of my workstation, and I have 256 GB of RAM.

orubaba commented 2 years ago

wow. Thanks. I was relying on my 16 ram laptop to do the work. it seems that was an ambitious thought. Now I see why I wasn't getting any headway.

muammar commented 2 years ago

wow. Thanks. I was relying on my 16 ram laptop to do the work. it seems that was an ambitious thought. Now I see why I wasn't getting any headway.

The chembl dataset is huge, and I think the script is doing its stuff but keeping everything in memory. At some point, you will run out of RAM. There are libraries, like Dask, that could allow you to work with processes requiring a huge amount of RAM but you would need to implement it. If you read the preprocess.py script, you will realize they are doing a pickle.dump at the end of the preprocessing procedure, if you could find a way to write it before and not waiting until the end, you can clear the garbage collector and free memory.

orubaba commented 2 years ago

Thanks so much for the suggestion. I am trying to run the get_vocab.py code on a far reduced size of the chembl dataset but got this error - multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f6c291da0a0>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed" - i have checked online but i haven't worked it out. kindly assist.

muammar commented 2 years ago

Thanks so much for the suggestion. I am trying to run the get_vocab.py code on a far reduced size of the chembl dataset but got this error - multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f6c291da0a0>'. Reason: 'PicklingError("Can't pickle <class 'Boost.Python.ArgumentError'>: import of module 'Boost.Python' failed" - i have checked online but i haven't worked it out. kindly assist.

See https://github.com/wengong-jin/hgraph2graph/issues/33

muammar commented 2 years ago

Forget about the message above. It is not using multiprocessing at all.