zdebruine / singlet

Single-cell analysis with non-negative matrix factorization
42 stars 13 forks source link

Advice for usage of Linked NMF for scATAC #46

Closed IsaacDiaz026 closed 8 months ago

IsaacDiaz026 commented 8 months ago

Hello,

I could use some suggestions on how to use linked NMF to resolve genotype effects in my single-cell dataset. I have scATAC-seq data for ~2000 cells. These cells are a pool of 19 different genotypes, and I would like to cluster them to annotate cell types but do not want the clustering to be confounded by genotype.

I tried using split.by ="genotype" in the RunNMF function, but it results in the factorization rank going from k=16 (without split.by), to (k=3). This led me to your article on linked NMF. Would it make sense to use linked NMF to define factors describing shared signal, and only use these factors for clustering?

Alternatively, I was thinking of just using the MetaDataPlot to visualize genotype representation within the NMFs, and removing NMFs with overrepresentation of specific genotypes?

Any suggestions are helpful, thanks!

zdebruine commented 8 months ago

@IsaacDiaz026 good question. split.by = "genotype" will scale the data from each genotype such that they sum to the same value. This ensures that both genotypes contribute equally to the NMF objective. This can be a problem if you have very different sample sizes between your genotypes.

It sounds like you want to use k = 16 NMF model, then look at whether you get some factors highly associated with one genotype over the other, and then fine-tune that NMF model after pruning the "links" between those factors and the samples belonging to the weakest of the genotypes to which they mapped. This "Linked NMF" fine-tuning process will help the model separate the signal better.

My suggestion is to run NMF without using split.by at a k = 16, then run LinkedNMF initializing with that NMF model. You don't need to worry about re-tuning the rank here, but you will need to use split.by = "genotype" to tell LNMF to prune factors based on relative genotype representation.

Here is a working example (you need to install latest version of singlet):

detachNamespace("singlet")
remotes::install_github("zdebruine/singlet")
set.seed(123)
library(ggplot2)
library(SeuratData)
library(Seurat)
library(singlet)
library(cowplot)
ifnb <- UpdateSeuratObject(ifnb)
ifnb <- RunNMF(ifnb, reps = 1, k = 30)
p1 <- MetadataPlot(ifnb, split.by = "stim", reduction = "nmf")
ifnb <- RunLNMF(ifnb, split.by = "stim", reduction.use = "nmf", link.cutoff = 0.5)
p2 <- MetadataPlot(ifnb, split.by = "stim", reduction = "lnmf")
plot_grid(p1 + ggtitle("Before Linked NMF"), p2 + ggtitle("After Linked NMF"))

image

Note that I did NOT use split.by in the first RunNMF function. In this case I assume that both categories of stim are about equally represented, or at least that the signal complexity of both is largely shared.

Happy to take this further via email if you need, debruinz@gvsu.edu. I am interested in your feedback once you work through the analysis.