wengong-jin / icml18-jtnn

Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)
MIT License
509 stars 190 forks source link

How are the logp-sa scores in the training data computed? #62

Open zichaow opened 2 years ago

zichaow commented 2 years ago

Hi there! I have a question regarding the logp-sa score computation. Using the simple script below, I can reproduce the log-sa scores in the test data data/zinc/opt.test.logP-SA BUT NOT in the training data data/zinc/train.logP-SA. Suggestions, advice, and explanations appreciated regarding this mismatch. Thanks!

from rdkit import Chem
from rdkit.Chem import Descriptors
from molopt import sascorer

# the smiles below is the first one in `data/zinc/opt.test.logP-SA`
# the score computed below (-2.5248038322) matches that in the file (-2.5248038322)
smiles = 'CC(C)OC(=O)c1cccc(-c2ccc([C@H]3[NH2+][C@H](C(=O)[O-])C(C)(C)S3)o2)c1'
score = Descriptors.MolLogP(Chem.MolFromSmiles(smiles)) - sascorer.calculateScore(Chem.MolFromSmiles(smiles))

# the smiles below is the first one in `data/zinc/train.logP-SA`
# the score computed below (3.412092566642019) DOES NOT matches that in the file (2.878620321486616174)
smiles = 'CCCCCCC1=NN2C(=N)/C(=C\c3cc(C)n(-c4ccc(C)cc4C)c3C)C(=O)N=C2S1'
score = Descriptors.MolLogP(Chem.MolFromSmiles(smiles)) - sascorer.calculateScore(Chem.MolFromSmiles(smiles))