zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

do we need to remove batch effect? #20

Closed Roger-GOAT closed 2 years ago

Roger-GOAT commented 2 years ago

Hi team, much better than the original NMF on saving time! When I did the original NMF, I used A <- scRNA@assays$integrated@scale.data all the samples match. However, in your article, "For example, different normalizations of the data or batch effects can lead to fundamentally different SVD results across most factors. On the other hand, because NMF factors are collectively updated, distinct technical issues are usually explained by a single factor while other factors are left unaffected, making it robust across datasets. " So I used A <- scRNA@assays$RNA@counts But the samples are dispersed. Which one should I use? Thanks!

zdebruine commented 2 years ago

Thanks for the question, and I'm happy to help.

Let me clarify/guess what exactly it is you are doing:

There's a reason for this: there is a factor (or several factors) in your NMF model that captures the batch effect, and so that factor(s) is moving the UMAP coordinates of affected cells by a corresponding amount. If the experiments describe similar information, you can look for factors in your model that are higher in samples from one experiment vs. the other experiment. It is at your discretion to remove these factors from the model, because they are explaining batch effects and not biological signal. Once you remove those "bad" factors, run UMAP again and samples should be aligned.

There is a reason why I prefer not to scale count data to “integrate” two experiments. This scaling requires some low-rank representation of the data which introduces bias itself (e.g. PCA), in addition to a manipulation of the data based on that reduction that likely confounds meaningful signal not captured by the reduction. Since batch effects are generally blocks of well-defined perturbations, NMF is a great way to capture them and discard them without scaling counts.

I'm planning to write a vignette on integrating latent models from multiple single-cell experiments in the future. I'll leave this issue open until I get to that.

If I'm not sufficiently clear, please ask further!

Roger-GOAT commented 2 years ago

@zdebruine thanks and you make it clear! "If the experiments describe similar information, you can look for factors in your model that are higher in samples from one experiment vs. the other experiment. It is at your discretion to remove these factors from the model, because they are explaining batch effects and not biological signal." Could you make this in detail: how to remove the factors in the model?

zdebruine commented 2 years ago

how to remove the factors in the model?

Just don't use that factor in downstream analysis (i.e. graph-based clustering, UMAP). Suppose factor 5 in your nmf_model is a batch effect. You can slice it out like this:

nmf_model <- RcppML::nmf(your_data, k = 10)
nmf_model <- list("w" = nmf_model$w[,-5], "d" = nmf_model$d[-5], "h" = nmf_model$h[-5,])

It's a lot harder to decide which factor to remove. Here are some ideas:

zdebruine commented 2 years ago

I’m going to close this issue and recommend LIGER as a solution for integrating single-cell experiments. Work in progress is focusing on a more efficient integrative NMF (iNMF) implementation, and exploration of other potentially useful methods, for purposes of joint latent space learning. This is not an issue with RcppML so much as a feature request, and one which will take significant time to develop.