xiaoyeye / CNNC

covolutional neural network based coexpression analysis
MIT License
72 stars 23 forks source link

What's the meaning of "data separation index list" file? #14

Open JFF1594032292 opened 3 years ago

JFF1594032292 commented 3 years ago

Hello! I want to predict some TF-target gene pairs with CNNC. However, I'm confused with the function of "data separation index list" file in section 7.2 and 8.2. The annotation said this file is a number list that divide gene_pair_list into small parts. There are 92472 gene pairs in the CNNC-master/data/mmukegg_new_new_unique_rand_labelx.txt, which is the gene pair list. However the mmukegg_new_new_unique_rand_labelx_num_sy.txt file only contained 10 numbers: image I don't know how this file split the gene pair file. And how should I create this "data separation index list" according to my own gene pair list?

xiaoyeye commented 3 years ago

Hi, This file is used to separate the gene pair list. Let's regard the gene with outgoing edge in KEGG as TF, and gene with ingoing edge as Target. Then this file is used to separate gene pairs for different TFs. "mmukegg_new_new_unique_rand_labelx_num.txt" contains all such numbers, and "mmukegg_new_new_unique_rand_labelx_num_sy.txt" only contains top 10 of the former file, which can be used to test if the code and environment works well in a short time. If you have your own TF-target list, you need to create your own index list. For example, TF1 (2,3) has 10 (5,6) targets and 10 (5,6) random nontargets, like 'TF \t target \t 1'+'TF \t nontarget \t 0'. Then this index list would be 0 20 30 42 .If you also want direction prediction, one more sample of "Target \t TF \t 2" will be added, so that the index list should be 0 30 45 63 . And a good cross validation way would be all gene pairs of TF1 and TF2 as train and validation, while TF3 as test. Thanks.

JFF1594032292 commented 3 years ago

Thanks for your response! So each TF-gene pairs would have three lines (for direction prediction): "TF \t target \t 1" + "target \t TF \t 2" + "TF \t nontarget 0", and it seems like that the "nontarget" line can be randomly selected from negative set. The index list should be the counts of TF targets (x 3) for each TF in gene pair list. Am I right? And if I want to predict some TF-target pairs which I don't know the label, how to generate the gene pair file? Maybe the gene pairs file only contain the "TF \t target" lines, or other format?

xiaoyeye commented 3 years ago

Hi, For the first paragraph. correct. For the second paragraph. I am not clear what do you mean by "some TF-target pairs which I don't know the label,". For the same TF, you know 100 targets, and you want to know if the remaining thousands of gene are target or not? If so, a possible way may be randomly select 100 gene from the remaning genes as negative to train the model. We have tried it before, and found it also worked. One alternative way may be to train model using other TFs and predict on the TF you focus on.

JFF1594032292 commented 3 years ago

Thanks! Now I understand the index list. For my second question, you understand it correctly. In section 7.3 to do prediction, I think the "NEPDF_pathway" is the gene pairs histogram file folder without label files from section 7.2 result, which means I don't know the labels of any gene pairs. So in section 7.2 to get NEPDF file, I give the gene pairs list with only "TF1 \t target1"+"TF1 \t target2" +......+"TFn \t target m", and the index list with each TF counts per line. The program runs well but I don't know if I did it right.....

xiaoyeye commented 3 years ago

Hi, The fact that the code can work is expected. However, if you have your own expression data, you have got to generate your own training data to train the model. Because supervised model has the assumption that all inputs are Independent and identically distributed. In your case, the model is trained by my data, while is used to predict your data, which breaks such assumption, If I understand correctly.

JFF1594032292 commented 3 years ago

Got it! Thank you very much!