string vs. graph representation

kjappelbaum commented 2 years ago

I think you mention that there is currently no paper on this and I think i agree with that. A nice hint is the guacamol leaderboard https://www.benevolent.com/guacamol and when we tested on SELFIES we also didn't find it outperform the graph based models. Clearly it is also related with the modeling but as you write I also think that the representation matters.

Would be interesting to do a proper benchmark, where one also considers the latent space of a model trained on string-based representation (e.g., a VAE, maybe trained jointly with some properties)

kjappelbaum commented 2 years ago

(and one nice reference for the symmetry and GNN chapter is probably https://geometricdeeplearning.com/)

whitead commented 2 years ago

I think your referring to this section at end of GNNs. That is a little old, I wrote a more nuanced discussion this summer specifically about SELFIES.

After thinking about it more since I wrote that - a GNN has atom permutation equivariance so you can predict things per atom/bond. A string is atom invariant - you can only predict per molecule properties. The advantages of the string representation are implementations are easier, few architecture choices (relative to GNN), and generation is trivial. As far as accuracy, I agree yes it's not clear. There is a persistent folklore that SMILES is worse than graph because there are long-range effects (closing braces) that are difficult to capture, but I've never seen convincing evidence. Definitely need to update the GNN section. Thanks for bringing this up!

whitead commented 2 years ago

Gonna call this good for now, but will revisit once SELFIES perspective is done and there is a bit more lit on this topic.

whitead / dmol-book

string vs. graph representation #66