generation of plot after TSNE

annarerra commented 1 year ago

To generate the plot after the TSNE you are proposing this code:

But i am not sure what is the variable "label".

Thank you in advance :)

xuxiaohan commented 1 year ago

label is the groundtruth in your dataset. NMI(normalized mutual information) is a metric for evaluating whether the clusters assigned by algorithm. It generates a NMI score for the 'group' by comparing with label. In real-world datasets for cancer subtyping task, we usually have no groundtruth label, but we can analysis the performance using additional clinical information, such as survival time. In the demo script you run, a dataset with groundtruth label was used. The dataset contain 2000 samples, 10 group. In each group, there are 200 samples.

xuxiaohan commented 1 year ago

calculating the NMI score is not necessary for plotting TSNE. If you do not care about the NMI score, just remove the line "print('NMI is'...."

annarerra commented 1 year ago

if you dont specify the number of clusters, the algorithm can choose by itself? I tried it without specifying the number and it chose 5 clusters.

xuxiaohan commented 1 year ago

It cannot. The build-in clustering algorithm in the package is 'k-means', which need to specific the number of clusters. However, it is not the drawback of MSNE, because the core idea of MSNE is to integrate partial multi-omics in a low-dimensional space by network embedding. You can also use any other clustering algorithm on the integrated embedding data by MSNE. such as louvain, leiden, spectrual clustering and so on. If you have no idea about how to set the number of cluster for your dataset, you can apply louvain on the integrated data, instead of kmeans. In fact, it is a question whether the number of clusters should be equal to the number of cancer subtypes you want defined. In some studies, researchers tend to get many clusters by computational algorithm and then manualy merge these small clusters by background knowledge. In the research of algorithm developing, we usually tend to set the number of clusters as the number of groundtruth labels for all methods to ensure fair comparison. Indeed, there also are some researchers think a good cancer subtying method should determine the number of clusters by itself, rather than inputting by users.

If you want to use the clustering methods that need users to specific the number of clusters, I can share you two common strategies to choose the number. (1). observing the sample groups in the dataset by TSNE or UMAP visualization, and guess the number of clusters roughly. (2). try the number of clusters 2 to K_MAX (e.g., 20), and then determine a optimal k in range of 2 to K_MAX by a quantative metric. (refer to the section 2.3 in our paper)

you may get further information in the following paper: S. Xu, X. Qiao, L. Zhu, Y. Zhang, C. Xue, L. Li, Reviews on Determining the Number of Clusters, Applied Mathematics & Information Sciences. 10 (2016) 1493–1512. https://doi.org/10.18576/amis/100428.

xuxiaohan commented 1 year ago

it chose 5 clusters, since the default value of this arguement of the function is 5. This number may have few reference value for your study.

annarerra commented 1 year ago

I have 4 datasets from Chronic lymphocytic leukemia with 200 samples and different amounts of features (mRNA, methylation, drugs, mutations), but no info about stages/types of disease.

I applied MSNE, here is my command: result=MSNE([view1,view2, view3, view4], n_clusters=2, k=20, workers=4, walk_length=20, num_walks=100, embed_size=100, window_size=10)

I used m=100 as it is a cancer dataset I used n_clusters=2 as I have some binary metadata (IGVH mutations, trisomy12) that I would like to associate with.

The TSNE looks like this:

What is your opinion on the clustering? Do you think I could play with the parameters to see if it makes some difference in the result?

Thank you again for the time and help :)

xuxiaohan commented 1 year ago

It seems that MSNE worked not well now. You need carefully check whether the problem is on the algorithm side or the data side. For algorithm side, you can increase walk_length and num_walks to capture more information. For example, walk_length=40,60,80; num_walks=100,150,200. These two numbers can gradually increase as your computing resources allow.

For data side, you need check each omics data. You need to check whether each omics data has a positive or negative influence on the integration results. Although MSNE can overcome the diference of the numerical scale among similarity networks, it can not distinguish which omics data is beneficial to integration. MSNE simply assume the information of each omics has equal weight.

If one omic dataset has negative influence for integration. There are two possible reasons. One is that the information in this data set conflicts with the similarity information of other omics (you may consider to remove this omics dataset), and the other is that the Gaussian kernel cannot establish a valuable similarity network on this omics data. We used Gaussian kernal for capturing similarity of samples refer to [1]. They [1] recommand to use chi-squared distance as the similarity measure for discrete omics data (such as your mutation data).

See this paper for more information about the similarity kernel selection for different omics data: [1] Wang, B., Mezlini, A., Demir, F. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11, 333–337 (2014). https://doi.org/10.1038/nmeth.2810

See this paper for more information about guidence of selecting the omic datasets combination: [2] Duan R, Gao L, Gao Y, et al. Evaluation and comparison of multi-omics data integration methods for cancer subtyping[J]. PLoS computational biology, 2021, 17(8): e1009224. Through extensive experimental analysis, Ran Duan et al. [2] give useful recommandation about selecting omics data combinations for different cancers.

annarerra commented 1 year ago

One thing that i notice is that my mutation dataset looks like this:

and one mutation dataset that you have used from TCGA was like this:

but i suppose MSNE doesnt care about the type of values, discrete or continuous, but as you said before maybe its better to use chi-squared distance as the similarity measure.

xuxiaohan / MSNE

generation of plot after TSNE #4