yuyangw / MolCLR

Implementation of MolCLR: "Molecular Contrastive Learning of Representations via Graph Neural Networks" in PyG.
MIT License
233 stars 57 forks source link

Discussion on the three issues of the "Molecular Contrastive Learning of Representations via Graph Neural Networks" #16

Closed liuyunwu closed 1 year ago

liuyunwu commented 1 year ago

Dear Authors,

I have a thorough read of your work on "Molecular Contrastive Learning of the conflict via Graph Neural Networks ", from which I also learned a lot and benefited a lot. Thanks to the authors for their hard work.

However, I would like to ask the following questions:First, the number of data sets does not correspond to each other. The data sets I downloaded from the platform provided by the link address of CodeOcean capsule in "Code availability" of the article did not correspond to the number of data sets in the paper. The data sets I extracted from "data.zip" were recorded and counted as follows: qm9:133885,qm8:21786,qm7:6834,Lipo:4200,ESOL:1128,FreeSolv:642,BBBP:2050,Tox21:7831,ClinTox:1483,HIV:82254,BACE:1513,SIDER:1427,MUV:186175.MUV, in particular, was very different: the MOLCLR paper counted 93,087 records, while 186,175 records were actually downloaded. There are also some other data that cannot be completely matched. Could the author please provide a data set of the same size as the paper?The second is the article directly using three enhanced strategies in the downstream task of molecular property prediction source code can provide, I have difficulties in achieving. Finally, Figure 3 on page 7 and Figure 4 on page 8, as well as Figure 4 on page 16 of Supplementary Information, could you please provide the source code, as I am limited in my ability, please also provide it?

I would like to thank the authors again for providing me with a learning opportunity.I look forward to your early reply! My email address is liuyw19@lzu.edu.cn.

I wish you all the best.

Yunwu Liu

yuyangw commented 1 year ago

Hi Yunwu,

Sorry for the late reply. I've been extraordinarily busy recently and couldn't work on the GitHub issues. To answer your questions:

  1. The link to download the exact datasets we used in the paper can be found in the repo README, which can also be downloaded from the webpage of MoleculeNet. We borrow the number of molecules from previously published works. But we'll look into that.
  2. To clarify, the data augmentations are only applied during the pretraining. To apply the augmentations on downstream tasks, you can refer to our AugLiChem package (https://github.com/BaratiLab/AugLiChem).
  3. I don't keep the source code to plot the figures. For tsne embedding, I used tsne function from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). To draw the figures, scatter and hist functions from matplotlib are used (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html & https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).

Hope this helps.

Best, Yuyang

liuyunwu commented 1 year ago

Dear Yuyang, Thank you very much for your reply. My results from the data sets Tox21, BACE and MUV based on the GIN pre-training model provided by GitHub cannot reach the results reported in the paper after I reproduced them. I don't know what's wrong. Can you give me some guidance?

Sincerely, Yunwu

yuyangw commented 1 year ago

Hi Yunwu,

To reproduce the performance on downstream benchmarks, we suggest hyperparameter search for each dataset during finetuning. Some hyperparameters that we found important for the performance include drop ratio, batch size, learning rate, etc. On BACE and Tox21, I would suggest a higher dropout ratio. Also for Tox21 and MUV containing multiple tasks, tuning on each task individually can also help.

Hope this helps.

Best, Yuyang

JongKook-Heo commented 1 year ago

Hi, Yuyang Wu!

Thanks for the detailed explanation. I'm wondering if, for multi-target tasks, each mean value per data in the table1 and table2 in your paper consists of different hyperparameter settings. (i.e, picking heterogeneous(best performed for each target) hyper-parameter combinations in ClinTox and take the average for individual run)

Sincerely, JongKook, Heo