xiaoyeye / CNNC

covolutional neural network based coexpression analysis
MIT License
72 stars 23 forks source link

clarification on getting started with CNN for bulk data and example gene-gene interactions #5

Closed amjass12 closed 4 years ago

amjass12 commented 4 years ago

Hello,

I read your recent paper in PNAS with great interest and am really excited by the elegant way in which it is able to detect gene interaction networks. I would like to apply this to my data and was wondering if you could provide some clarification on a few questions I have, thanks!

I have built a standard classification NN with keras, that is aimed at understanding bulk RNA-seq data. The samples all consist of information such as sex, age and organ. so if I have an expression matrix of samples as cols and genes as rows:

sample 1 will have its own organ, sex and age label etc... This is therefore a multi-class multi label classification task (binary_crossentropy with sigmoid activation at the final layer)....

I read your paper with great interest:: I would like to combine this network with the CNN you have created so that I could identify gene-gene interactions between different sexes, different ages and organs etc.... basically, for the different classes...

One thing that is unclear to me from the source code, is where you obtain your gene-gene interaction data, and where this is fed in to the network as the reference interactions data? are y_train labels still the classes in my data such as organ age etc...

Is it possible for this network to be joined in to the sequential model at some concatenated layer between two networks?

sorry for the vague question, but I am just trying to understand the potential application here..

thanks!!

xiaoyeye commented 4 years ago

Hi,

Thanks for your interest. For the gene interaction data, It depends on the specific task you want and we extracted it from other datasets. For example, it is from Chip-seq for TF-target prediction, from KEGG and Reactome database when we are trying to predict all gene interactions based on a large comprehensive single cell dataset and from GSEA gene set when we are trying to predicton gene functions.

Of course it can merge two or more networks. You can use Keras functional API to implement it easily.

I am not sure if I have answered your question completely. Please let me know if you have more questions.

amjass12 commented 4 years ago

Hi @xiaoyeye

thank you so much for the really fast response!!

this indeed answer my questions, but i am also a little confused about getting started at training the model (specifically the input data is what is a little more confusing to me).... I am unsure about where the interaction data is entered in to the model when being trained... so if i take the example of the CNN in the 'train_new model' folder:: I am unsure of where my expression information comes in relative to the gene-gene interaction information....

so if the model is:

model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))

///

history = model.fit(x_train, y_train... etc

upstream of of model=Sequential()....

I have X_train data (this is my expression data, sample as rows and features as columns) this works for the traditional sequential model.

I am guessing that the y_train data are the gene interactions as per the gene_pair_list interactions? so 0. interacting, 1 not interacting?

How is this implemented upstream of the model=Sequential as input to the CNN along with my expression data... this is something less clear to me even with the pipeline in the 'train_new_model' folder. is this present in this line of code?

(x_train, y_train,count_set_train) = load_data_TF2(train_TF,data_path) (x_test, y_test,count_set) = load_data_TF2(test_TF,data_path) print(x_train.shape, 'x_train samples') print(x_test.shape, 'x_test samples') save_dir = os.path.join(os.getcwd(),str(test_indel)+'YYYY_saved_models_T_32-32-64-64-128-128-512_e'+str(epochs)) ## the result folder if num_classes >2: y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) print(y_train.shape, 'y_train samples') print(y_test.shape, 'y_test samples')

is count_set_train the the gene pair interactions from a database such as KEGG. X_train the count data? and y_train the classes (interacting not interacting)..

is it possible to provide an example of where each file is for each component of the code upstream of the CNN model being defined?

thanks again for your assistance!

xiaoyeye commented 4 years ago

Hi, Sorry for the confusion. The basic idea is that, for each gene pair, after training, based on the two's expression, CNNC can determine if the two genes follow any relationship, like interaction, TF_target, in the same function and disease. As is shown in section 8 in readme, users first need to provide a gene pair list to define the ground truth, like A\tB\t0 A\tC\t1 D\tE\t0 D\tF\t1 Then, for each gene pair, we can generate NEPDF (it is a 2D histgram for the gene pair expression) with groundtruth list for each gene pair using script 'get_xy_label_data_cnn_combine_from_database.py'. By this way, we generate x_data, y_data, that can be separated as x_train, x_test, y_train y_test.

Then script' train_with_labels_three_foldx.py' is used to train the model.

You can find the details in the readme part.

xiaoyeye commented 4 years ago

let me know if you can not follow section 8.

amjass12 commented 4 years ago

Hi @xiaoyeye

thank you so much! i will work on this properly tomorrow and i will then message you again here if that is ok. even though these are command line instructions i will generate a script in an ide. I will use the kegg gene pair list provided just to generate the pipeline...

I will be in touch tomorrow. thank you so much for your rapid assistance once again!!

amjass12 commented 4 years ago

Hi @xiaoyeye

I am very close to completing the NEPDF matrix but have a couple of questions (and some errors i am running in to)::

I have modified the source code to generate the data directly as follows:

` import pandas as pd from numpy import * import json, re,os, sys

save_dir = os.path.join(os.getcwd(),'NEPDF_data') if not os.path.isdir(save_dir): os.makedirs(save_dir)

def get_gene_list_bulk(file_name): import re h={} s = open(file_name,'r') #gene symbol ID list of bulk RNA-seq for line in s: search_result = re.search(r'^([^\s]+)\s+([^\s]+)',line) h[search_result.group(1).upper()]=search_result.group(2) # h [gene ID] = gene symbol s.close() return h

def get_sepration_index (file_name): import numpy as np index_list = [] s = open(file_name, 'r') for line in s: index_list.append(int(line)) return (np.array(index_list))

h_gene_list_bulk =get_gene_list_bulk('CNNC/bulk_gene_listENSEMBL.txt') #'bulk_gene_list.txt')

########## generate NEPDF matrix gene_pair_label = [] s=open('CNNC/mmukegg_new_new_unique_rand_labelx.txt')#'mmukegg_new_new_unique_rand_labelx.txt')#) ### read the gene pair and label file for line in s:

print(line)

gene_pair_label.append(line)

gene_pair_index = get_sepration_index('CNNC/mmukegg_new_new_unique_rand_labelx_num_sy.txt')#'mmukegg_new_new_unique_rand_labelx_num.npy')#sys.argv[6]) # read file speration index s.close() `

This works fine... please note that, CNNC/bulk_gene_listENSEMBL.txt is the the txt file but inverted so that ensembl ID's come first and gene symbols second. th reason for this is the index of my counts amtrix is Ensembl IDs and not gene names (I believe this is the right way of changing the file, to just swap): it looks like:

ENSMUSG00000006403': 'adamts4', 'ENSMUSG00000006411': 'nectin4', 'ENSMUSG00000006412': 'pfdn2', 'ENSMUSG00000006418': 'rnf114',

etc....

I am running in to an error generating the x_tf_bulk data. My count matrix is not reading correctly the ENSEMBL IDs from h_gene_list_bulk. Also, i am unsure about which file format sys.argv[7] is... is there an example, can this be ignored?

for gene_pair in gene_pair_label_array[start_index:end_index]: ## each speration separation = gene_pair.split() if sys.argv[7] == '1': x_gene_name,y_gene_name,label = separation[0],separation[1],separation[2] y.append(label) else: x_gene_name, y_gene_name = separation[0], separation[1] z.append(x_gene_name+'\t'+y_gene_name) can the flag label be ignored?

secondly.. I am runing in to trouble (and also have some questions about the final bit of code)

` gene_pair_label_array = array(gene_pair_label) for i in range(len(gene_pair_index)-1): #### many separations print (i) start_index = gene_pair_index[i] end_index = gene_pair_index[i+1] x = [] y = [] z = [] for gene_pair in gene_pair_label_array[start_index:end_index]: ## each speration separation = gene_pair.split() x_gene_name,y_gene_name,label = separation[0],separation[1],separation[2] y.append(label)

    x_tf_bulk = normCounts[h_gene_list_bulk[x_gene_name]][0:10] + 10 ** -2)  ## 249 means the number of samples, users can just remove '[0:249]'
        x_gene_bulk = normCountsscale1[h_gene_list_bulk[y_gene_name]][0:249] + 10 ** -2)
        H_T_bulk = histogram2d(x_tf_bulk, x_gene_bulk, bins=32)
        H_bulk= H_T_bulk[0].T
        HT_bulk = (log10(H_bulk / 43261 + 10 ** -4) + 4)/4

        x.append(HT_bulk)

`

x_tf_bulk:: I am getting a key error: KeyError: 'gm20878'

or, if i run the code:: just to see how extraction is working::

x_tf_bulk = normCounts[h_gene_list_bulk][0:10] + 10 ** -2

I get an error, None of index:Index(['ENSMUSG00000000001', 'ENSM.... are in the [columns]

This is not true though as these gene symbols are a column in my count matrix.

and I had a couple of questions regarding that line of code: I see in the source code you log transform the data, I have left this out as the data are already VST transformed, but I have also scaled them for training. would you recommend leaving vst (basically log2) transformed counts or is it ok to keep the data scaled ot between 0 and 1 here?

My last question is the purpose of +10**-2 operation?

thanks so much!

xiaoyeye commented 4 years ago

Hi, There are really a lot of things to impact the result, and it would be very hard for me to debug since I do not have you data.

I wonder if you can just download the data provided by the project. Actually I have tested all of them several times. It should be the easiest way for both you and me. Once you completed the tutorial, you can compare all detailed difference between the provided data and your own data. including the file format, gene name, column row meanings and so on. I would be happy to discuss with you if you have any problem when using the provided data.

xiaoyeye commented 4 years ago

If the problem is that the single cell data is too large, you can just use the bulk data, and set the sc_gene_list and sc_expression files as "None". Let me know if you can do it. Best Ye

amjass12 commented 4 years ago

thank you so much! yes i have just done this with the single_cell (rank_total_gene_rpkm_SC.h5) file and it works perfectly. I believe the issue is with my h5 file! i will look into this further to see how i can solve this problem.

So to your last point, you recommend the input as being non-log transformed normalised counts because of the log function in the script?

xiaoyeye commented 4 years ago

Yes, Because the script does the log. Cheers!

amjass12 commented 4 years ago

great thank you!!

unfortunately, the problem is persisting.... I am getting a keyError when running the script linked to this line of code:

x_tf_bulk = log10(rpkm_bulk[h_gene_list_bulk[x_gene_name]]

I have saved my count matrix as follows (to h5 format) and as far as I am aware, this should be the same as the rank_rpkmSC.h5 file:

normCounts.to_hdf('bulkCounts.h5', key='rpkm', mode='w') The index is ensembl IDs and columns are samples with expression values (each row= one gene).. please note I have also tried by making the gene index columns also. still the same error..

thanks!

xiaoyeye commented 4 years ago

Hi, Is it solved yet?

amjass12 commented 4 years ago

Hi @xiaoyeye !

thank yo for following up!

Yes, i did solve and i managed to run the 3x train on the command line without issues... for the NEPDF matrix creating... there were 2 issue..

1.) the count matrix was not transposed! it was format, row genes, cols samples: so a simple transpose fixed this...

and 2.) one more error which was: KeyError, gene 'x'

was because it was not present in my count matrix! i simply removed these from the gene pair list. this reduced ~94k interactions from the MMUK_new unique rand.. etc to about ~89k interactions.

I will now run this network properly, understand the data, interpretation etc and then incorporate this as a merged model in to my sequential model. I am very interested in how gene-gene interactions change according to the classes in my data (age organ etc).

Please may i ask, do I have full permission to use this network for my own work? (on my own data of course)...

Thank you so much once again for all your support!! I may still come back here to ask another question haha!!

xiaoyeye commented 4 years ago

Sure. All data here is public, and please cite the corresponding papers if possible.

amjass12 commented 4 years ago

great!! many thanks, yes all work will be cited of course :)

amjass12 commented 4 years ago

Hi @xiaoyeye ,

Thank you again for all your help above! I have trained the bulk_data from your example based on the KEGG gene-gene interactions. Here is a couple of the predicted results and was wondering if you would be able to help me in some interpretation and a question: I merged the gene-gene interaction names with the final predictions... this generates the following:

gene_pair1 gene_pair2 interaction 0 1 2
olfr1491 lamc3 0 0.9973838 0.00196537 0.00065081
il2 il2ra 1 0.17841083 0.67426646 0.14732274

Is the following intepretation correct: label 0 is no interaction (and therefore no interaction between olfr1491 and lamc3)

label 1 means gene 1 regulates gene 2 (so il2 regulates il2ra) and vice-versa?

the second question i have is the most important... based on these known gene-gene interactions -- is it possible to infer gene-relationships between genes that are NOT on the gene pairs list? If i read the manuscript correctly this is possible but something I am confused about. How would i start to infer gene relationships between genes in my gene expression matrix that are not part of the gene pair lists?

thanks!!

xiaoyeye commented 4 years ago

Hi, Sorry for the dealy. I do not know why I did not receive any reminder. The 1st question: you are correct. The 2nd question: I think you can predict totally unknown gene pairs, just generate the NEPDF for any gene pair, then use the trained model to predict it. Actually we have tested such strategy and showed the result in supplement Fig. 8G. As can be seen, the performance is somehow lower, but still good.

amjass12 commented 4 years ago

Hi @xiaoyeye

no worries, please dont worry about the delays, i really appreciate that you are helping!

thank you for confirming question 1.

I need to look again at fig8g in the paper however, what i mean is.. not for known gene pairs that are not in the gene pair list at the beginning, i mean can it predict gene interactions in my count matrix if those genes are not in the original gene pair list and i also DONT know if they interact in real life? does this make sense?

thank you!

xiaoyeye commented 4 years ago

I am also confused. I think your question is that (A, B) are not in the training gene pair list, and you want to know if the two are interacting, right? That is exactly what the model did. The model first learned NEPDF patterns for the gene interactions in the training set, and then it should be able to predict new gene pairs based on their NEPDF pattern in the test set (the trained model never see 'the same gene pair' in the test, here we want to evaluate the performance, so we use known gene pair as test, of course you can let the trained model predict any gene pair), even though the model never saw 'the two individual genes' before (this result is in Fig. S8G).

amjass12 commented 4 years ago

Hi @xiaoyeye ,,

thank you! yes i think is what i mean... but I am still unsure if this is what i am after. The test set can give a score about whether 2 genes are interacting from the test set, but this is based on the known 94k KEGG gene-pairs that the model was trained on.

Can the model predict novel interactions and if so, how do i identify them (this is what i am stuck with)?

so for example: in my post above:

gene_pair1 gene_pair2 interaction 0 1 2
olfr1491 lamc3 0 0.9973838 0.00196537 0.00065081

This is an evaluation of the test dataset.. that shows that olfr1491 does not interact with lamc3.

but, olfr1491 and lamc3 are in the 'mmukegg_new_new_unique_rand_labelx.txt' file, which is the database of known gene-pairs that interact.

In my expression data, olf1491 and lamc3 are of course present, so their interaction can be predicted (trained) and then model.evaluate on the test set.

but in my expression data, i also have genes that are NOT in the 'mmukegg_new_new_unique_rand_labelx.txt' file.

so for example, in my dataset, i have genes called sparc and opn. these two genes are not in the mmukegg file.

but if they are interacting (or potentially interacting) is it possible to identify that these two genes could be interacting in my expression data based on the trained model?

if so how, would i generate these new gene pair predictions? this is what i am having trouble with. how would i identify or access the genes that are not in the mmukegg list as potential interactors and then pass them to the model.evaluate method.

thank you!! i would be happy to provide you my email to talk as well by email if you would like...

xiaoyeye commented 4 years ago

Sorry for the delay.... I think of course you can. After trainning, the model should be able to predict any gene pair. There are two kinds of "not in the list". 1) The gene pair is not while the single gene is in. That is the case in Fig3, where we just make sure that there is no 'gene pair overlap' between train and test. 2) The second is more strict. The two genes are not in the list so that train and prediction gene set are completely separated, so that there is no 'gene overlap' between train and test. That is the case in Supp Fig. 8G. Sure, you can provide your email if you have any more questions.

amjass12 commented 4 years ago

thank you @xiaoyeye ,

i think both of these cases still assume that the gene pairs need to have be trained on (or their interactions need to be known)... it wont be able to generalize or predict a new gene unknown interaction?

do you think this would work:

generate a a NEPDF matrix of random gene pairs in the matrix (admittedly this list would be incomprehensibly large) but it could be made small just to make examples.. and then run model.predict to see if there are interacting predicting pair based on the trained dataset?

thank you! please could you email me at:

aj930564@gmail.com

cheers!