Open bsaldivaremc2 opened 1 year ago
The same problem, did you solved it?
I did not solve it. But. I am skipping some functionality to make it work with the provided pre-trained model and vocabulary.
I noticed that when the anchor_smiles in decoder.decode (decoder.py) is more than one, there is an error.
So I limited that the anchor_smile would be just one by adding :
if len(anchor_smiles)>1: continue
in hgraph/decoder.py
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) <-Here I added if len(inter_cands) == 0:
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) if len(anchor_smiles)>1: continue if len(inter_cands) == 0:
I probably solved the problem. It works the first 900 million times you generate. Instead of the original vocab use this: https://github.com/bsaldivaremc2/hgraph2graph/blob/master/data/chembl/recovered_vocab_2000.txt python generate.py --vocab data/chembl/recovered_vocab_2000.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000. I captured all motifs that were causing the problem and included them in the original vocab list I replaced 27 less used motif pairs. Details of the files here: https://github.com/bsaldivaremc2/hgraph2graph/tree/master/data/chembl
When following the instructions of the README.md neither of the commands shown, seem to work out of the box. So far I added the py_modules=['hgraph'] in the setup.py and added ",clearAromaticFlags=True)" in the chemutils.py file.
Sample from checkpoint does not work:
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000
So I tried to reproduce the vocab with:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > new_vocab.txt
It works. But, new_vocab.txt has 5625 lines and data/chembl/vocab.txt 5623. And there are multiple differences, not just two.Do you have any way to sample from checkpoint without issues? Also, why am I getting a different vocab result from the same data/chembl/all.txt file? Is there some random operation? I left all random seeds as they are in the scripts.