welch-lab / liger

R package for integrating and analyzing multiple single-cell datasets
GNU General Public License v3.0
380 stars 78 forks source link

How can I determine the similarity scores between clusters using Liger's UINMF? #314

Closed HaixJiang closed 1 month ago

HaixJiang commented 2 months ago

Hi, How can I determine the similarity scores between clusters across different species using Liger's Cross-Species Analysis with the UINMF method? I hope to derive some numerical values, rather than making subjective judgments through UMAP and plotSankey. Thanks!

mvfki commented 2 months ago

Hi,

Currently, LIGER only provides calcARI() and calcPurity() for quantitatively evaluating clustering similarity. However, they might not be directly what you expect but I'd still explain a bit, in case. So subjectively looking at the Sankey plot allows using the joint-clustering result as a reference and see how the (preferably existing) cell types from two species map to it. The two metrics, ARI and Purity, simply compares two sets of annotation applied on the same set of cells. So if you want to compare the known annotation of cells from species A against the annotation of cells from species B, these metrics won't work. But after running integration and getting joint clustering, you can compare the joint clustering subset to species A with the known annotation of species A, and then do the same with species B. If metric values for both comparisons turn out to be high, that can imply that the integrated joint clustering well captures the original cell type population from each species.

Best, Yichen

HaixJiang commented 2 months ago

@mvfki Thank you for your prompt reply. Is it possible for me to add the annotations for Species A and Species B in the lig's metadata, and then calculate the distance between the UMAP coordinates of each cell subset from the two species to quantify the similarity between the two species?

mvfki commented 2 months ago

You definitely can! For operation on metadata, you can do cellMeta(lig, "ann", useDataset = "A") <- annotation_A and then for B, or combine them first and then insert:

ann_comb <- c(annotation_A, annotation_B)
lig$ann <- ann_comb

Note that, the expected order of cells can be checked with colnames(lig). And there are more details in our documentation.

Beyond that, if you would like to use the subset distance approach, I don't think it's very necessary to insert metadata because the insertion is only required for a rliger package function to make use of it, while we don't currently have any function that can do what you propose. Furthermore, if you would like to calculate the distance, it should make more sense to use the integrated low-dimensional representation which can be accessed with getMatrix(lig, "H.norm"), rather than using UMAP which distorts it in some degree.

HaixJiang commented 2 months ago

@mvfki Thank you for your advice. Is H.norm the multi-species integrated factor matrix I obtained after performing UINMF?

mvfki commented 2 months ago

Yes, you run runUINMF() first and then quantileNorm() to finish the factor alignment. H.norm is what you get after quantile normalization.