do we need to remove batch effect?

Roger-GOAT commented 2 years ago

Hi team, much better than the original NMF on saving time! When I did the original NMF, I used A <- scRNA@assays$integrated@scale.data all the samples match. However, in your article, "For example, different normalizations of the data or batch effects can lead to fundamentally different SVD results across most factors. On the other hand, because NMF factors are collectively updated, distinct technical issues are usually explained by a single factor while other factors are left unaffected, making it robust across datasets. " So I used A <- scRNA@assays$RNA@counts But the samples are dispersed. Which one should I use? Thanks!

zdebruine commented 2 years ago

Thanks for the question, and I'm happy to help.

Let me clarify/guess what exactly it is you are doing:

You have two different experiments, with a batch effect between them
You integrate using Seurat (possibly rPCA?) which involves scaling counts, then run NMF on scaled count data, and notice that UMAP on NMF coordinates looks nice (no obvious batch effect)
You go back and run NMF on raw counts, but find the UMAP on NMF coordinates does not look nice (obvious batch effect persists)

There's a reason for this: there is a factor (or several factors) in your NMF model that captures the batch effect, and so that factor(s) is moving the UMAP coordinates of affected cells by a corresponding amount. If the experiments describe similar information, you can look for factors in your model that are higher in samples from one experiment vs. the other experiment. It is at your discretion to remove these factors from the model, because they are explaining batch effects and not biological signal. Once you remove those "bad" factors, run UMAP again and samples should be aligned.

There is a reason why I prefer not to scale count data to “integrate” two experiments. This scaling requires some low-rank representation of the data which introduces bias itself (e.g. PCA), in addition to a manipulation of the data based on that reduction that likely confounds meaningful signal not captured by the reduction. Since batch effects are generally blocks of well-defined perturbations, NMF is a great way to capture them and discard them without scaling counts.

I'm planning to write a vignette on integrating latent models from multiple single-cell experiments in the future. I'll leave this issue open until I get to that.

If I'm not sufficiently clear, please ask further!

Roger-GOAT commented 2 years ago

@zdebruine thanks and you make it clear! "If the experiments describe similar information, you can look for factors in your model that are higher in samples from one experiment vs. the other experiment. It is at your discretion to remove these factors from the model, because they are explaining batch effects and not biological signal." Could you make this in detail: how to remove the factors in the model?

zdebruine commented 2 years ago

how to remove the factors in the model?

Just don't use that factor in downstream analysis (i.e. graph-based clustering, UMAP). Suppose factor 5 in your nmf_model is a batch effect. You can slice it out like this:

nmf_model <- RcppML::nmf(your_data, k = 10)
nmf_model <- list("w" = nmf_model$w[,-5], "d" = nmf_model$d[-5], "h" = nmf_model$h[-5,])

It's a lot harder to decide which factor to remove. Here are some ideas:

Use GO term enrichment (see how to do this in Supplementary Material for our bioRXiv manuscript, Rmarkdown for Figure 2: https://www.biorxiv.org/content/10.1101/2021.09.01.458620v1.supplementary-material). Factors without significantly enriched GO terms likely aren't describing biological signal.
Look at the distribution of values in the factor. Sometimes the distribution of sample or feature weights in factors describing batch effects don't look at all like the distribution of sample or feature weights in normal factors (gamma distribution for feature weights in "w", normal distribution for sample weights in "h").
Compare the sample weights for each factor in one dataset vs. the other dataset, assuming the datasets describe very complimentary information. Factors strongly up in one or the other dataset suggest signal is specific to those samples (i.e. batch effect).

zdebruine commented 2 years ago

I’m going to close this issue and recommend LIGER as a solution for integrating single-cell experiments. Work in progress is focusing on a more efficient integrative NMF (iNMF) implementation, and exploration of other potentially useful methods, for purposes of joint latent space learning. This is not an issue with RcppML so much as a feature request, and one which will take significant time to develop.

zdebruine / RcppML

do we need to remove batch effect? #20