yuyangw / MolCLR

Implementation of MolCLR: "Molecular Contrastive Learning of Representations via Graph Neural Networks" in PyG.
MIT License
233 stars 57 forks source link

About QM9 #18

Closed DuanhaoranCC closed 1 year ago

DuanhaoranCC commented 1 year ago

Hi dear author, after reading your impressive writings, we have the following questions.

  1. The QM9 specification shown in the text is 130829, but the mainstream specification is 130831, I would like to ask what causes this.
  2. QM9 has 12 regression tasks, why only 8 are shown in the text?
  3. how does QM9 divide the training? More precisely, what is the fine-tuning method in Appendix 2?
  4. the existing mainstream methods all do pre-training on the ZINC15 dataset and then fine-tune eight datasets, you seem to use a larger pre-training dataset.
  5. would like to ask you how to get the molecular representation visualization in the article, more specifically how the node colors are obtained, I am clear about the work of TSNE.

Thank you for your answers and help, and good luck with your research!

yuyangw commented 1 year ago

Hi, thanks for your interest in our work. To answer your questions:

  1. We borrow the number from previous literature which may cause the mismatch.
  2. We run 8 out of 12 regression tasks in QM9 during revision. Due to the limitation of revision time, we didn't exhaust all the tasks. But the data is available and you can test the remaining 4 tasks if interested.
  3. We conduct a random split for QM9 as mentioned in the main manuscript.
  4. Yes, that's a good point and we have included the discussions in the manuscript. And comparing with pre-training methods using 10M data (e.g., ChemBERTa), we are achieving better performance.
  5. In the tsne plot, each node is colored by the molecular weight which can be obtained by RDKit.

Hope this helps.

Best, Yuyang