Data source of rank_total_gene_rpkm.h5

VicChen1998 commented 4 years ago

Hi!

I'm interested in the CNNC and I have read your article and run the example script.

here I have some questions.

firstly I noticed in your article you mentioned that your scRNA-seq dataset is from [13] A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018).

so from this reference article I visit their website https://scquery.cs.cmu.edu/ and download the dataset https://s3.us-east-2.amazonaws.com/sc-query/processed_data/expr_data.hdf5

but this dataset looks not the same with the provided rank_total_gene_rpkm.h5. is there anything I get wrong or you reprocess the scquecy's dataset? also could you please specify the source or reprocess procedure of other provided datasets? many thanks!

secondly I run the training script follow readme.md and meet some problem.

1) it took much more time to train the model than I expected. I remember you mentioned in the article's supplementary table that it need about 6 hours to run the kegg whole expression data with 1080ti. while I run with a Tesla K40m (12G memory) and it takes about 30 hours to complete the 200 epochs. during the runtime I check the gpu and memory usage, the gpu is run at full capacity(~98%) but i found the process use only ~120m gpu memory, ~3g memory, but occupied over 100g virtual memory (is it means most of its data is stored on hdd thus make it run so slow?)

2) after 200 epochs here is the result:

from the end_result.pdf, the train acc start from 0.4 and reach 0.65 after 20 epochs but val acc keep stuck below 0.5 and the val_loss evem keep increasing during 200 epochs while train loss is decreasing.

could you please figure out what did i get wrong? all data provided by readme.md, my command like this:

python3 get_xy_label_data_cnn_combine_from_database.py
    bulk_gene_list.txt
    sc_gene_list.txt
    mmukegg_new_new_unique_rand_labelx.txt
    mmukegg_new_new_unique_rand_labelx_num.txt
    mouse_bulk.h5
    rank_total_gene_rpkm.h5
    1

python3 train_with_labels_wholedatax.py
    3057
    NEPDF_data/
    3

I'm very new to the machine learning so please forgive me if I made any stupid mistake or asking stupid question :)

xiaoyeye commented 4 years ago

Hi, Thanks for your interest. 1) For the dataset. There may be some difference. a) they may add more cells, b) the data I provide is normalized. But the number and name of the gene should be the same. 2) Because the whole data is about 3G, ~120m gpu memory, ~3g memory should be resonable,, and each batch may occup 120M memory. You can set a larger batch size to accelerate the training. It is supersing why 100G virtual memory is occupied. 3) I believe such curves are due to hyperparameter settings. It has a too large learning rate. Of course, it is really a hard problem to find ssuitable settings. I just tried the code with a learning rate of 0.0001 and validation split of 0.5, and found that after 200 epochs, the val loss reaches 0.97+ and the val accuracy is 0.48+. And when I used learning rate of 0.001 and validtion split of 0.2, after around 10 epochs, the val loss reaches 0.97+ and val accuracy reaches 0.49+.

VicChen1998 commented 4 years ago

thanks for the reply! i gonna check this out.

VicChen1998 commented 4 years ago

hello! i would like to cite both of your CNNC and GCNG in my bachelor's degree thesis. but for GCNG, i don't know if it's appropriate to cite a preprint paper. if it's ok, what format should i use to cite it? thanks!

xiaoyeye commented 4 years ago

Hi, You can use google scholar to search the GCNG paper, and then click the double quotes to get the citation. Hope it will help you. Best

Ye Yuan, Postdoc

Machine Learning Department

Carnegie Mellon University

Tel:(+1)510213-3332

------------------ Original message ------------------ From: "Vic Chen"; Sendtime: Friday, Jun 19, 2020 5:32 AM To: "xiaoyeye/CNNC"; Cc: "502283190"502283190@qq.com; "Comment"; Subject: Re: [xiaoyeye/CNNC] Data source of rank_total_gene_rpkm.h5 (#10)

hello! i would like to cite both of your CNNC and GCNG in my bachelor's degree thesis. but for GCNG, i don't know if it's appropriate to cite a preprint paper. if it's ok, what format should i use to cite it? thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

VicChen1998 commented 4 years ago

i got it. thank you! love your CNNC model! BEST!

xiaoyeye / CNNC

Data source of rank_total_gene_rpkm.h5 #10