Open roselightheart opened 2 years ago
Getting the same issue - here's the exact error message for others' reference:
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/vocab.txt --ncpu 16 --mode single
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/marcase/hgraph2graph/preprocess.py", line 19, in tensorize
x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 153, in tensorize
tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 194, in tensorize_graph
fnode[v] = vocab[attr]
File "/home/marcase/hgraph2graph/hgraph/vocab.py", line 43, in __getitem__
return self.hmap[x[0]], self.vmap[x]
KeyError: 'C1=NN=CN1'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/marcase/hgraph2graph/preprocess.py", line 106, in <module>
all_data = pool.map(func, batches)
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 771, in get
raise self._value
KeyError: 'C1=NN=CN1'
Found a super easy solution to this problem - just generate a fresh vocab from the dataset rather than using the one provided. I think an rdkit update changed a couple of the ways the smiles strings are generated, particularly from the aromatic groups (this was mentioned in another issue thread).
In order to use the model checkpoint trained on chembl, you need to be on
rdkit=2019.03.4
, which isn't mentioned in the readme. If you're on a newer version, you'll get aKeyError
when the model tries to look up SMILES in its vocabulary. I know this repo is sparsely maintained, so I'm mostly leaving this as a search term for anyone else who wants to use that checkpoint in the future.