Error while training a new model

adyprat commented 4 years ago

Hi, I'm trying to follow your documentation on predicting edge scores on your example data. I ran into the following issues:

I get the following error when I ran the command KERAS_BACKEND=theano python predict_no_y.py 9 NEPDF_data/ trained_models/KEGG_keras_cnn_trained_model_shallow.h5

The output is: Using Theano backend. select <class 'type'> (309, 64, 32, 1) x_test samples Traceback (most recent call last): File "predict_no_y.py", line 81, in model.load_weights(model_path) File "anaconda3/envs/cnnc/lib/python3.7/site-packages/keras/engine/saving.py", line 458, in load_wrapper return load_function(*args, **kwargs) File "anaconda3/envs/cnnc/lib/python3.7/site-packages/keras/engine/network.py", line 1208, in load_weights with h5py.File(filepath, mode='r') as f: File "anaconda3/envs/cnnc/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in init swmr=swmr) File "anaconda3/envs/cnnc/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (truncated file: eof = 8650752, sblock->base_addr = 0, stored_eof = 8661656)

I'm using Linux OS and installed the packages using Anaconda. The h5 files are obtained after extracting the .rar file containing the models. Could you provide the original .hd5 files without using .rar compression?

So I tried training my own model using KERAS_BACKEND=theano python train_new_model/train_with_labels_wholedatax.py 9 NEPDF_data/ 3. Then I got another error:

Using Theano backend. 0 12 1 12 2 144 3 48 4 15 5 18 6 3 7 12 8 45 Traceback (most recent call last): File "train_new_model/train_with_labels_wholedatax.py", line 63, in (x_train, y_train,count_set_train) = load_data_TF2(whole_data_TF,data_path) File "train_new_model/train_with_labels_wholedatax.py", line 51, in load_data_TF2 yydata_x = yydata_array.astype('int') ValueError: invalid literal for int() with base 10: 'olfr1136\tgnal'

I'm not sure how to fix this.

Also, the usage for scripts get_xy_label_data_cnn_combine_from_database.py and predict_no_y differ from documentation and the actual scripts. Which is the correct command?

I'll appreciate it if you can help me with this/update your documentation accordingly.

Thank you, -Aditya

xiaoyeye commented 4 years ago

Hi, Thanks very much for your interest and attention. I am very sorry that there are some bugs and unclear staffs. Now I corrected all potential bugs and made the readme clear. plz download the latest version.

Below is the command line I used right now. All the four work now. plz read the readme carefully and make sure that all files have correct paths, since users may have different computer environments.

generate train and test data

@compute-0-16 CNNC-master]$ python get_xy_label_data_cnn_combine_from_database.py /home/yey/CNNC-master/data/bulk_gene_list.txt /home/yey/CNNC-master/data/sc_gene_list.txt /home/yey/CNNC-master/data/mmukegg_new_new_unique_rand_labelx.txt /home/yey/CNNC-master/data/mmukegg_new_new_unique_rand_labelx_num_sy.txt /home/yey/sc_process_1/new_bulk_mouse/prs_calculation/mouse _bulk.h5 /home/yey/sc_process_1/rank_total_gene_rpkm.h5 1
(pay attention to flag setting. if you want to use the data to train new models, you should set it as 1)

######## make prediction with trained model @compute-0-16 CNNC-master]$ python predict_no_y.py 9 /home/yey/CNNC-master/NEPDF_data 3 /home/yey/CNNC-master/trained_models/KEGG_keras_cnn_trained_model_shallow.h5

################ 3 fold cross validation train @compute-0-16 train_new_model]$ python train_with_labels_three_foldx.py 9 /home/yey/CNNC-master/NEPDF_data 3 > results.txt

############### train model using all data @compute-0-16 train_new_model]$ python train_with_labels_wholedatax.py 9 /home/yey/CNNC-master/NEPDF_data 3 > results_whole.txt

############# prediction using the model with all data @compute-0-16 CNNC-master]$ python predict_no_y.py 9 /home/yey/CNNC-master/NEPDF_data 3 /home/yey3/pnas_github/CNNC-master/train_new_model/xwhole_saved_models_T_32-32-64-64-128-128-512_e200/keras_cnn_trained_model_shallow.h5

If you have any problems, plz do not hesitate to contact me.

Ye Yuan, Postdoc

Machine Learning Department

Carnegie Mellon University

------------------ Original ------------------ From: "adyprat";notifications@github.com; Send time: Wednesday, Sep 4, 2019 0:48 AM To: "xiaoyeye/CNNC"CNNC@noreply.github.com; Cc: "Subscribed"subscribed@noreply.github.com; Subject: [xiaoyeye/CNNC] Error while training a new model (#3)

Hi, I'm trying to follow your documentation on predicting edge scores on your example data. I ran into the following issues:

I get the following error when I ran the command KERAS_BACKEND=theano python predict_no_y.py 9 NEPDF_data/ trained_models/KEGG_keras_cnn_trained_model_shallow.h5

The output is: Using Theano backend. select <class 'type'> (309, 64, 32, 1) x_test samples Traceback (most recent call last): File "predict_no_y.py", line 81, in model.load_weights(model_path) File "anaconda3/envs/cnnc/lib/python3.7/site-packages/keras/engine/saving.py", line 458, in load_wrapper return load_function(*args, **kwargs) File "anaconda3/envs/cnnc/lib/python3.7/site-packages/keras/engine/network.py", line 1208, in load_weights with h5py.File(filepath, mode='r') as f: File "anaconda3/envs/cnnc/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in init swmr=swmr) File "anaconda3/envs/cnnc/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (truncated file: eof = 8650752, sblock->base_addr = 0, stored_eof = 8661656)

I'm using Linux OS and installed the packages using Anaconda. The h5 files are obtained after extracting the .rar file containing the models. Could you provide the original .hd5 files without using .rar compression?

So I tried training my own model using KERAS_BACKEND=theano python train_new_model/train_with_labels_wholedatax.py 9 NEPDF_data/ 3. Then I got another error:

Using Theano backend. 0 12 1 12 2 144 3 48 4 15 5 18 6 3 7 12 8 45 Traceback (most recent call last): File "train_new_model/train_with_labels_wholedatax.py", line 63, in (x_train, y_train,count_set_train) = load_data_TF2(whole_data_TF,data_path) File "train_new_model/train_with_labels_wholedatax.py", line 51, in load_data_TF2 yydata_x = yydata_array.astype('int') ValueError: invalid literal for int() with base 10: 'olfr1136\tgnal'

I'm not sure how to fix this.

Also, the usage for scripts get_xy_label_data_cnn_combine_from_database.py and predict_no_y differ from documentation and the actual scripts. Which is the correct command?

I'll appreciate it if you can help me with this/update your documentation accordingly.

Thank you, -Aditya

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

xiaoyeye commented 4 years ago

another reminder, all command lines are just used for demo. If you want to run the whole real data, plz replace "mmukegg_new_new_unique_rand_labelx_num_sy.txt" with "mmukegg_new_new_unique_rand_labelx_num.txt", and replace all "9" with '3057', which is the real number of separations

adyprat commented 4 years ago

@xiaoyeye Thank you for looking into this.

I have another question about line 128 in get_xy_label_data_cnn_combine_from_database.py

128: HT_bulk = (log10(H_bulk / 43261 + 10 ** -4) + 4)/4

I understand that in line 134 you would divide by 43261 since that is the number of cells in sc data, why divide bulk data by 43261 instead of 249?

Thanks, Aditya

xiaoyeye commented 4 years ago

Hi, You are welcome.

Yeah, You can also choose 249. Actually, for other following projects I am working on, I just use the samples size for normalization. In my experience, 249 and 43261 both work well.

logic behind + 10 ** -4) + 4)/4 is that, after normalization by 43261, the possible range of entry in the histogram should be approximately [0/43261, 1]. Single cell RNA-seq data always suffer dropout, which lead to a very high density in zero position dominating the whole matrix. So we have to use a log transformation for each entry to solve this problem.

It is notice that the second min value is 1/42361≈ 10-4. For convenience, here we chose 10-4 as a pseudo count so that it will not change the histogram too much, especially for those high entries, which are larger than 2. After log with pseudocount, the entry range is around [-4,0]. Finally we get a range ([-4,0]+4)/4 ≈ [0,1], which is what range fed into CNNC.

Of course you can use whatever normalization strategy you want to get such [0,1] range.

Best

------------------ Original ------------------ From: "adyprat";notifications@github.com; Send time: Friday, Sep 6, 2019 0:46 AM To: "xiaoyeye/CNNC"CNNC@noreply.github.com; Cc: "Mention"mention@noreply.github.com; Subject: Re: [xiaoyeye/CNNC] Error while training a new model (#3)

@xiaoyeye Thank you for looking into this.

I have another question about line 128 in get_xy_label_data_cnn_combine_from_database.py

128: HT_bulk = (log10(H_bulk / 43261 + 10 ** -4) + 4)/4

I understand that in line 134 you would divide by 43261 since that is the number of cells in sc data, why divide bulk data by 43261 instead of 249?

Also, could you explain the logic behind + 10 ** -4) + 4)/4 part in that line, instead of just having log10(H_bulk / 43261 + 10)?

Thanks, Aditya

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

xiaoyeye commented 4 years ago

And, perhaps the range of [0,1] is not necessary, once all train and test data are uniformly normalized, the network should work well.

adyprat commented 4 years ago

Sure, I'll go ahead and use sample size for normalization. I was just curious because in order to have a range of (0,1] I thought we should just divide by 249 for bulk data, since dividing the bulk data's histogram values by 43261 would have a range of at most (0,249/43261] which is approximately (0,0.006]. Thank you for taking the time to address my comments.

-Aditya

xiaoyeye / CNNC

Error while training a new model #3

generate train and test data