xuxiaohan / MSNE

A network embedding based method for partial multi-omics integration in cancer subtyping
3 stars 3 forks source link

"MSNE avoid comparing the edge weights directly between different networks​" #3

Open annarerra opened 1 year ago

annarerra commented 1 year ago

Hello,

It is a question not related to git but most to the algorithm of MSNE. You mention that MSNE is not comparing the edge weights between different networks, but this is important no? What is the reason for this? and is there any advantage?

Thank you in advance :)

xuxiaohan commented 1 year ago

Thanks for your attention!

Different omics data typically following different distributions and varying numerical scales. This difference may be reflected in the values on the edge of the similar networks. (if different similarity function, or kernal function are used for the multi-omics data, the same question may arise). the similarity network-based integration methods typically need to consider the difference of the numerical scales among different network. If the similarity values in a network are significantly biger than the values in other networks, the integration results may be dominated by the information one omics. When integrating data from multiple omics, researchers usually hope that the weights of these omics is equal or controllable.

In MSNE, we avoid the uncontrollable weights of these omics by "not comparing the edge weights between different networks". in the idea of MSNE, the weights was controlled by the transfer probability across networks. In our implementation and experiments, we simply consider the equal transfer probability across networks.

You can regard it as a normalization for avoiding the potential problem caused by the different distribution of multi-omics data.

annarerra commented 1 year ago

During the random walk, it seems that to change omic layer you need to find an overlapping feature first? image For example in this image you dont go directly from 1(orange) to 11(blue), but you have to pass first from the 1(blue).

xuxiaohan commented 1 year ago

Thanks for your comments. Yes, in each step of random walking, MSNE select a omic layer (similarity network) that include the current sample, and then select a neighbor of the current sample in the selected omics layer (the similarity network). In this figure, one node represents a sample (not feature), the edge and its value characterize the similarity of two nodes (samples). The transfering across layers does not count into the random-walking sequence, since the currrent sample is not changed. In other words, the blue(1) and the orange(1) represent the same samples. The dashed lines in figure(B) may should be drawed using two different colors to help readers to understand the sampling process in MSNE. I hope my answer is helpful to you.

feel free to contact me if you have any question.

xuxiaohan commented 1 year ago

From a methodology perspective, the essence of network embedding lies in characterizing neighbor relationships of nodes through their distance in embedding space. Sampling sequences such as Orange(1)->Blue(1)->Green(1) has no information about the similarity of samples. Furthermore, such sequence fragments separate truly meaningful neighbor relationships in sequences. We may need longer window_size of skip window, and more training time to capture the separated useful similarity informations.

annarerra commented 1 year ago

Another question, when you talk about partial datasets, you mean that if there is no overlap of samples of one dataset with the others, we can consider this dataset as partial?

xuxiaohan commented 1 year ago

Given a dataset contain N matrix, the rows (named by sampleID) and columns (named by featureID,such as gene symbols) in each matrix refer to the samples and their molecular features in the omics layer. We consider a dataset as 'full dataset' when the row names of each matrix is identical. If we randomly remove some rows from some omic matrix, we consider this is a partial multi-omics dataset. It is because that there are some samples (partial samples) do not have N omics feature rows. If there is a omic matrix which row names have no overlap with others, it is an extreme situation of partial datasets. MSNE cannot deal with this case now, because we cannot establish a connection between the samples in this omic with others using random walking process. This case is an interesting computational challenge in single-cell multi-omics datasets integration, since cells typically were consumed when measuring one omics. However, for bulk-seq datasets, we should consider whether there are corresponding real application scenarios for this computational problem. If yes, it is worth tackling.