Pretrained chembl model requires old rdkit

roselightheart commented 2 years ago

In order to use the model checkpoint trained on chembl, you need to be on rdkit=2019.03.4, which isn't mentioned in the readme. If you're on a newer version, you'll get a KeyError when the model tries to look up SMILES in its vocabulary. I know this repo is sparsely maintained, so I'm mostly leaving this as a search term for anyone else who wants to use that checkpoint in the future.

marshallcase commented 2 years ago

Getting the same issue - here's the exact error message for others' reference:

python preprocess.py --train data/chembl/all.txt --vocab data/chembl/vocab.txt --ncpu 16 --mode single
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/marcase/hgraph2graph/preprocess.py", line 19, in tensorize
    x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
  File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 153, in tensorize
    tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
  File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 194, in tensorize_graph
    fnode[v] = vocab[attr]
  File "/home/marcase/hgraph2graph/hgraph/vocab.py", line 43, in __getitem__
    return self.hmap[x[0]], self.vmap[x]
KeyError: 'C1=NN=CN1'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/marcase/hgraph2graph/preprocess.py", line 106, in <module>
    all_data = pool.map(func, batches)
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'C1=NN=CN1'

marshallcase commented 2 years ago

Found a super easy solution to this problem - just generate a fresh vocab from the dataset rather than using the one provided. I think an rdkit update changed a couple of the ways the smiles strings are generated, particularly from the aromatic groups (this was mentioned in another issue thread).

wengong-jin / hgraph2graph

Pretrained chembl model requires old rdkit #44