wengong-jin / icml18-jtnn

Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)
MIT License
509 stars 190 forks source link

Exception: Explicit valence for atom # 4 C, 5, is greater #34

Closed ManvithaPonnapati closed 5 years ago

ManvithaPonnapati commented 5 years ago

I was trying to get the fast_molvae code to run on my own dataset and I changed the preprocess.py function code a little bit to run on my parallel worker setup. In the process I ended up removing the rdkit logger that's set to critical and I noticed a bunch of these exceptions:

[23:37:03] Explicit valence for atom # 1 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 3 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 6, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 6, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 6, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 6, is greater than permitted [23:37:03] Explicit valence for atom # 3 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 7, is greater than permitted [23:37:03] Explicit valence for atom # 3 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 7, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 5, is greater than permitted [23:37:03] Explicit valence for atom # 1 C, 6, is greater than permitted [[2323::3737::0303] ] Explicit valence for atom # 1 C, 5, is greater than permitted Explicit valence for atom # 4 C, 5, is greater

I think they are coming from get_clique_mol-> sanitize() in the chemutils.py. I am not sure if I am missing something but is this supposed to happen? I tried to repeat the same thing with your training set and same set of logs appear. And next to the return statement in the sanitize method in utils I notice that there is a comment - #We assume this is not None . So are we supposed to "clean" our data to not include the smiles for which this happens ?

ManvithaPonnapati commented 5 years ago

Specifically here are some of the smile strings for which the sanitize method in chemutils resolves to be None


Mol is None
C1=CC=C2C(=C1)[CH2:14][CH:14]=[CH:14][CH:14]=2
[00:00:46] Explicit valence for atom # 1 C, 5, is greater than permitted
Mol is None
C[C:4]1(=O)[CH2:4][NH:4][CH2:4][CH:4]([NH2:2])[NH2+:4]1
[00:00:46] Explicit valence for atom # 5 C, 5, is greater than permitted
Mol is None
C[CH:4]1[CH2:4][NH:4][CH2:4][C:4](=O)([NH2:2])[NH2+:4]1
[00:00:46] Explicit valence for atom # 1 C, 5, is greater than permitted
Mol is None
[CH3:2][C:2]1([CH3:3])=[CH:8][N:8]=[CH:8][CH:8]=[CH:8]1
[00:00:46] Explicit valence for atom # 3 C, 5, is greater than permitted
Mol is None
C1=CC=C2C(=C1)=[CH:14][CH2:14][CH2:14][CH:14]=2
[00:00:46] Explicit valence for atom # 3 C, 5, is greater than permitted
Mol is None
C1=CC=C2C(=C1)=[CH:14][CH2:14][CH2:14][CH:14]=2
[00:00:46] Explicit valence for atom # 1 C, 5, is greater than permitted
Mol is None
C[C:4]1(=O)[CH2:4][NH:4][CH:4]([NH2:2])[CH2:4][NH2+:4]1
[00:00:46] Explicit valence for atom # 4 C, 5, is greater than permitted
Mol is None
C[CH:4]1[CH2:4][NH:4][C:4](=O)([NH2:2])[CH2:4][NH2+:4]1
[00:00:46] Explicit valence for atom # 1 C, 5, is greater than permitted
Mol is None
[OH:2][C:15]1([NH2:3])[CH:15]=[CH:15][CH:15]=[CH:15][CH:15]=1```
wengong-jin commented 5 years ago

Hi,

This is expected behavior if you remove the RDKit logger. The invalid smiles come from the graph assembly algorithm in graph decoder (not get_clique_mol->sanitize()). This happens when the assembler attaches two cliques in an invalid way, and those invalid attachments are removed during the enumeration.