unable to achieve the accuracy from paper

hangqi98 commented 4 years ago

hi there i'm interested in your CNNC model and have read your PNAS paper but here i met some problems to reproduce the result from paper

i run the demo literally follow the readme but the result is not good.

at about 20 epoch the model start to over fitting, during the rest 180 epoch the validation loss keep increasing while validation accuracy decreasing, (the result graph is just like other issues posted), finally the peak val acc before overfitting is about 48 to 52% (for my 3 try with learning rate 0.01, 0.003, 0.001), which it's far from the paper described.

it's seems that there is something wrong with the current codes. is the demo command in readme.md up to date? is the training code fit the demo? how did you divide the test set from raw data and get the test accuracy? from current predict_no_y.py it seems that it's reading the training set to do the prediction job on a very overfitting model.

would you please check the code or upload the latest version of code? thanks very much! looking forward to your reply and your next paper of GCNG!

xiaoyeye commented 4 years ago

Hi, Thanks for your interest. The fact that the accuracy is about 50% is reasonable, Please notice that 1) it is a three label classification with a random guess baseline of 0.33 accuracy. 2) figure3 is generated in a way that we first calculated the ROC area for each outgoing gene (it has outgonig edges in KEGG database), and it cotains a lot of outgoing genes. So we collected all these ROCs as the final ROC. It is possible that such ROC is very high but with a around 50% accu. Image a binary classification, the 1st outgoing gene has prediction of [0.1, 0.2] for its label 0 and 1, while 2nd outgoing gene had prediction of [0.6, 0.7] for its label 0 and 1. So the overall accu will be 0.5, however the collected ROC will be 1, right? 3) I believe you used the code that train model using whole data. If you use the code "train_with_labels_three_foldx.py", you should be able to get much higher ROC results.

For the so called overfitting that validation accu and loss both increase. We also found such scenario in the three fold cross validtion that in the 1st and 2nd fold it got such scenario after around 50 epoches training, while in the 3rd fold, It will not enter such scenario unless after hundreds of epoches, which indicates that such Scenario is due to the arbitrary data separation. Even using fold 1 and fold 2 with 50 epochs, and fold 3 with 200 epoch, we can still reach a very high collected ROC definitely (ROC of 0.9+).

I am not sure if it is suitable to call "both accu and loss increase" as overfit considering that biological data is always very noisy, becuase it is possible to occur theoretically. Image a bianry classification, when the model corrects the wrong prob for label 1, say from 0.49 to 0.51, while it also give a more wrong prediction for another label 0 sample which is very similar to the above label 1 sample, say also from 0.51 to 0.53. In such case, the accuracy and loss would both increase. And such scenario may be because some samples are wrongly labelled or that the dataset is too difficult for the classifier to distinguish them.

Last, as mentioned in the paper, to avoid information leakage, we filtered the gene pair list so that there is no overlap gene pair between trainning and test set. That is to say in the test, CNNC would never see any training gene pair at all.

Thanks also for your interest in my GCNG paper.

hangqi98 commented 4 years ago

thanks for the fast reply!

i'm now have no problem with the test set. it's actually a misunderstanding. in the readme #8.5 the command arguments 'NEPDF_data/' of predict_no_y.py is easily make first-time user pass the training set data path to do the prediction job. maybe you can change it to something like 'NEPDF_data_predict' and explicitly ask user to make a new folder and generate their new NEPDF for the gene pairs they want to predict.

for the overfitting, what i met is not 'both accu and loss increase', is 'train acc ↑ loss ↓ but val acc ↓ loss ↑', finally the peak val acc is about 50%.

yesterday i believe it's kind of overfitting so i just randomly change the model structure and unexpectedly found that by remove the last 2 conv layer and decrease all dropout to 0.2 (i think 0.5 is too high and may lead to info or feature loss), the model can reach 62% of acc in both train and val set after 30 epoch. so i just run it for 300 epoch and found the val acc reach 72% and still increasing slowly while the train acc is 79%, which is quite good result for a three labels classification task with these implicit bio data.

for this try i'm using my own generated data. i retrieve about 39000 activation edge and 11000 inhibition edge from kegg using the R package 'graphite' you mentioned in your paper. then i randomly choose 17000 pairs of gene to generate the irrelative edges. for these irrelative edges, i make sure that for each pair the gene a and b is never co-exist in any kegg pathway to decrease the possibility of they have undetected relationship. then i specify 'irrelative' as label 0, 1 for activation and 2 for inhibition. with this gene pair list i build the nepdf to feed the model and get a satisfactory accuracy.

for the roc, i just don't understand how the final roc be counted. how to "collected all these ROCs as the final ROC", i never heard the roc can be calculate like this. could you please explain it with more details or provide me a tutorial link to learn? thanks very much!

xiaoyeye commented 4 years ago

Hi, You are welcome. I will think about how to make it clearer. Thanks for your suggestions. It is very great to hear that changing model structure can improve the performance and that you got satisfactory accu with your own data.

Such ROC is inspired by the TF-target task, and then applied to the KEGG prediction. Of course you can use any evaluation strategy you prefer. Actually we discussed it in SI. Now I copy it here:

For TF-binding prediction task, we used a binary classification. Take MESC as an example. MESC has 38 TFs. For each TF a, the image of its target b , (a, b) was generated as input fed to CNNC, meanwhile a negative image was also generated for TF a and randomly selected gene, R from nontarget gene set to balance the dataset, (a, R). In three-fold CV, the 38 TFs was divided into roughly equal 3 folds, (13, 13, 12). Two folds have 13 TFs, and one fold has 12 TFs. CNNC was trained by any of the two folds, and tested by the left one fold. Each test fold may contain several or dozens of TFs, so we split each fold test result to get the score for every TF, and concatenated all TF results of the three folds as the final result. For KEGG pathway prediction, we used a three-category classification. We used the same way to balance and split dataset. For each known gene pair (a, b) as label 1, where a regulates b, we generated (b, a) as label 2, and a randomly selected gene pairs (r, s) as label 0 from the KEGG gene set. In total KEGG dataset has 3,057 genes with outgoing edges (gene a (regulator)), which was also divided into roughly equal 3 folds, and the same three-fold CV was adopted. The separation is to guarantee two points: a) there is no ‘gene pair + label’ overlap between the whole training and test dataset, which is the key to avoid20 information leakage. b) In addition, the training and test set are strictly separated by the regulator list, so that training set has 2/3 of regulators while test set has the remaining 1/3, and there is no overlap between the training and test regulator sets. By this way, we can do the detailed evaluation for each regulator, like what was done for each TF. Each test fold contains around 1,000 regulators, so we split each fold result to get the score for every regulator, and concatenated all regulators’ AUC of the three fold as the final result.

hangqi98 commented 4 years ago

sorry for reply late

i do read the supporting info of your paper previously but now i realize i'm not actually understand all of it due to the insufficient of my english and the lack of machine learning knowledge. i'm still very new to the machine learning and i need more time to consolidate the foundation before i can communicate with you. but i really love your CNNC model, the way it build the histogram from two genes' expression value to feed the cnn is impressive. really appreciate for teach me a lot!

xiaoyeye commented 4 years ago

Your english is very good. Thanks very much for your interest. You are welcome. Best

xiaoyeye / CNNC

unable to achieve the accuracy from paper #12