Unable to replicate results from the paper

adyprat commented 4 years ago

Hi, I'm trying to reproduce the results from your PNAS paper on CNNC. Specifically, I'm trying to reproduce the leave-one-TF-out results presented in Fig 2. for GTRD TF-target prediction for the Macrophage dataset. However, my validation accuracy and loss doesn't seem to improve in spite of running with recommended parameters from the supplement for over 200 epochs. Moreover, when I test it on the held-out TF, all the predicted output values are nearly identical and result in a close to random predictors performance.

Here's the steps I followed for the sake of reproducibility:

1) I first generated the NEPDFs using the following command: python get_xy_label_data_cnn_combine_from_database.py None data/sc_gene_list.txt data/bone_marrow_gene_pairs_200.txt data/bone_marrow_gene_pairs_200_num.txt None bone_marrow_cell.h5 1.

2) I then used this command to train a new model: KERAS_BACKEND=theano python train_new_model/train_with_labels_wholedatax.py 12 NEPDF_data/ 2

3) Here's the output at the end of training: . Note that the validation accuracy never improves beyond 0.45 and neither does training accuracy, which is close to that of a random predictor.

4) I then tested it on a held out NEPDF data using the: python predict_no_y.py 1 NEPDF_data/ 2 xwhole_saved_models_T_32-32-64-64-128-128-512_e200/keras_cnn_trained_model_shallow.h5. The output has all values set to 1 (i.e., high confidence for all the edges). The output in predict_results_no_y_1/y_predict.npy is just an array filled with identical values.

I'm not sure what I'm missing here. Did you observe similar training and validation plots in your end_result.pdf at the end of training on this dataset? I'm seeing similar behavior for the mESC, dendritic cell, and on my own scRNA-seq datasets. I also attached the full log I obtained while training here: Could you please help me with this?

Thanks in advance, Aditya

xiaoyeye commented 4 years ago

Hi, Thanks for your interest. Actually the code is for KEGG tasks which contains three labels, and I did not upload the code for TF-target task. Now I have uploaded the TF code in the folder "train_new_model". The reason why it did not work is clear. Since the bone TF-target list has a format such as : "TF target" 1 "Target TF" 2 "TF random" 0. It has three labels, but you set the class number as 2. The soultion lies in the new loaded "train_with_labels_three_foldx_3fold_TF_two_labels.py" where the load data function just removes label 2. If everything goes well, you are expected to get the following trainning result: (You should be able to see the following figures).

BTW, the number is 13 instead of 12. Let me know if you have any problems.

adyprat commented 4 years ago

Hi, Thank you for your quick response. In order to train a new model, would it then suffice to simply have edges corresponding to labels 0 and 1 in https://github.com/xiaoyeye/CNNC/blob/master/data/bone_marrow_gene_pairs_200.txt before I invoke get_xy_label_data_cnn_combine_from_database.py? I'm not sure if that is the problem alone, since my custom ydata_tf files only had two classes on my own datasets.

In any case, it looks like the new code uploaded is for 3-fold edge CV. Do you happened to have the code you used to generate the figure 2 in the paper, which is leave-one-TF-out evaluation for GTRD dataset? I set the number to 12 because I wanted to train it on 12 TFs and test it on the 13th, I'm not sure if I was doing it right. Moreover, I was using the train_with_labels_wholedatax.py to train the model and not the 3Fold-CV script. I'd really appreciate if you can upload code/directions to regenerate Figure 2 or even just the leave-one-TF-out evaluation script, or did you also use the train_with_labels_wholedatax.py for that?

Best, Aditya

xiaoyeye commented 4 years ago

Hi, You are welcome. The new code I uploaded is to remove label 2 in https://github.com/xiaoyeye/CNNC/blob/master/data/bone_marrow_gene_pairs_200.txt . If your ground truth does not contain label 2, you should use the original one. I believe you can find the detailed difference in the data load code.

For the leave-one-tf-out, what we did is that we first did three-fold cross validtion, where each fold may contain several TFs. For examples, for the Bone marrow cell, the three folds may contain 4, 4, 5 TFs respectively. We then divided each fold into each TF to do the mimic leave-one-TF-out evaluation. We discussed the details in PNAS supplement. The new 3 fold code I uploaded is exactly what we used to generate Fig. 2.

Of course, the best way is to train with 12 TFs and test on the left one for 13 times, but it would take too much time. If you would want to, you can simply change the code from 3-fold to 13 fold, something like 'test_indel in range (12), and train_list = [i for i in whole list if i!=test_indel]'. It should be much convenient. Hope it can help you.

adyprat commented 4 years ago

Thank you for the clarification. I just have one more question about NEPDF generation. Specifically, line 134 here: https://github.com/xiaoyeye/CNNC/blob/9ce5172863fa77999ebfaefeac475b498bc44856/get_xy_label_data_cnn_combine_from_database.py#L134

What is the purpose of taking log10 twice? I understand normalization steo in line 130, by why take log10 of histogram values again in line 134?

xiaoyeye commented 4 years ago

Hi,Yes, we did log twice. Because We were dealing with sc data which has a large drop out so that NPEDF would be concentrated in (0,0)， having a very high peak. So we used one more log to mitigate it. For bulk data or other data without drop out, it is not necessary.

Ye Yuan, Postdoc

Machine Learning Department

Carnegie Mellon University

Tel:(+1)510213-3332

------------------ Original message ------------------ From: "adyprat"; Sendtime: Wednesday, May 27, 2020 11:52 PM To: "xiaoyeye/CNNC"; Cc: "502283190"502283190@qq.com; "Comment"; Subject: Re: [xiaoyeye/CNNC] Unable to replicate results from the paper (#7)

Thank you for the clarification. I just have one more question about NEPDF generation. Specifically, line 134 here: https://github.com/xiaoyeye/CNNC/blob/9ce5172863fa77999ebfaefeac475b498bc44856/get_xy_label_data_cnn_combine_from_database.py#L134

What is the purpose of taking log10 twice? I understand normalization steo in line 130, by why take log10 of histogram values again in line 134?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

adyprat commented 4 years ago

Thanks for the clarification.

xiaoyeye / CNNC

Unable to replicate results from the paper #7

Hi,Yes, we did log twice. Because We were dealing with sc data which has a large drop out so that NPEDF would be concentrated in (0,0)， having a very high peak. So we used one more log to mitigate it. For bulk data or other data without drop out, it is not necessary.