Questions about dataset

ziqi92 commented 4 years ago

Hi, I read your paper, and want to build new pair datasets with other thresholds. Your paper said the training datasets are curated from ZINC250K. However, I found that using ZINC250K, I cannot construct the pairs as many as you did.

I also found that some molecules in your training pairs don't exist in the dataset ZINC250K. I'm not sure whether there are any issues with my dataset. Would you mind sharing your scripts about constructing the training pairs?

Besides, I found that the training pairs about logp06 in this project are different with the pairs in the latest hgraph2graph project. Are there any changes in the training datasets in hgraph2graph?

Thanks

wengong-jin commented 4 years ago

Hi, I was using the chemfp library to find molecular pairs in the ZINC250K. I don't have specific scripts for that. It was basically running chemfp in the command line. For DRD2, the molecular pairs are extracted from the DRD2 dataset used in Olivecrona et al. 2017 instead of ZINC250k. You can find the data here: https://github.com/MarcusOlivecrona/REINVENT/releases/download/v1.0/data.tar.gz

For hgraph2graph, the dataset is exactly the same. The smiles look different because of different smiles ordering. If you canonicalize all the smiles, it will be the same.

ziqi92 commented 4 years ago

Thanks for your reply!

I did as you said, and found that these datasets do share the same molecules. Thank you for your help! Actually, the reason why I have this question is due to the different number of pairs in these datasets. The number of pairs in hgraph2graph is slightly less than the datasets in this project. I guess the datasets in this project may contain some repetitive molecule pairs, and these molecule pairs are removed in hgraph2graph?

Another thing is that, I already convert all smiles into canonical smiles. However, I cannot match some molecules in your dataset with the ZINC250K. I can show you several examples about that. For example, I find this molecule "O=C(CC[NH+]1CCOCC1)Nc1ccc(NC(=O)CCC2CCCC2)cc1F" in your logp06 dataset, the canonical smiles should be "O=C(CCC1CCCC1)Nc1ccc(NC(=O)CC[NH+]2CCOCC2)c(F)c1". But I can't find this canonical smiles in ZINC250K (I already canonicalize all smiles in ZINC250K)

I can provide many other examples from logp06 dataset, such as: "O=C(CCOc1cccc(Cl)c1)N1CCC(CC(=O)c2ccccc2)C1" "O=S(=O)(NC(C)c1oc2ccccc2c1C)N(CC)Cc1ccc(Cl)cc1"

I really appreciate your help if you can give me some suggestions about how to find these molecules in ZINC250K. Thank you!

wengong-jin commented 4 years ago

Hi,

Thank you for pointing that out. I don't remember the exact details how I curated the dataset. I think I may have added molecules from the ZINC database (superset of ZINC250K). I will change the description in the paper from ZINC250K to ZINC.

ziqi92 commented 4 years ago

Hi,

Thanks for your reply! I would like to close this issue.

wengong-jin / iclr19-graph2graph

Questions about dataset #4