vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625

wengong-jin / hgraph2graph

Hierarchical Generation of Molecular Graphs using Structural Motifs

MIT License

367 stars 108 forks source link

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625 #47

Open bsaldivaremc2 opened 1 year ago

bsaldivaremc2 commented 1 year ago

When following the instructions of the README.md neither of the commands shown, seem to work out of the box. So far I added the py_modules=['hgraph'] in the setup.py and added ",clearAromaticFlags=True)" in the chemutils.py file.

Sample from checkpoint does not work: python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000

So I tried to reproduce the vocab with: python get_vocab.py --ncpu 16 < data/chembl/all.txt > new_vocab.txt It works. But, new_vocab.txt has 5625 lines and data/chembl/vocab.txt 5623. And there are multiple differences, not just two.

Do you have any way to sample from checkpoint without issues? Also, why am I getting a different vocab result from the same data/chembl/all.txt file? Is there some random operation? I left all random seeds as they are in the scripts.

FlexxofIvan commented 1 year ago

The same problem, did you solved it?

bsaldivaremc2 commented 1 year ago

I did not solve it. But. I am skipping some functionality to make it work with the provided pre-trained model and vocabulary. I noticed that when the anchor_smiles in decoder.decode (decoder.py) is more than one, there is an error. So I limited that the anchor_smile would be just one by adding : if len(anchor_smiles)>1: continue in hgraph/decoder.py inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) <-Here I added if len(inter_cands) == 0: inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) if len(anchor_smiles)>1: continue if len(inter_cands) == 0:

bsaldivaremc2 commented 1 year ago

I probably solved the problem. It works the first 900 million times you generate. Instead of the original vocab use this: https://github.com/bsaldivaremc2/hgraph2graph/blob/master/data/chembl/recovered_vocab_2000.txt python generate.py --vocab data/chembl/recovered_vocab_2000.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000. I captured all motifs that were causing the problem and included them in the original vocab list I replaced 27 less used motif pairs. Details of the files here: https://github.com/bsaldivaremc2/hgraph2graph/tree/master/data/chembl