zdebruine / singlet

Single-cell analysis with non-negative matrix factorization
42 stars 13 forks source link

Clarification on input for NMF -- raw counts or normalized data? #51

Closed stephaniehicks closed 7 months ago

stephaniehicks commented 8 months ago

Hi,

Thank you for this great package! I wanted to ask about the first paragraph here (https://zdebruine.github.io/singlet/articles/Guided_Clustering_with_NMF.html#run-nmf)

NMF can be run on all features using normalized counts. 
Here we apply standard log-normalization, which works very well, but any form of approximate variance stabilizing transformation is suitable for helping NMF find meaningful solutions. 
Raw counts are not suitable for NMF because the model pays too much attention to features with very high counts.

I noticed that in singlet the documentation says NMF should be run on normalized counts, not raw counts. But, my understanding of NMF (and the underlying funciton RcppML::nmf()) is run on the raw counts (https://github.com/zdebruine/RcppML?tab=readme-ov-file#r-functions).

Can you clarify why the normalized counts are appropriate for NMF? Thanks!

zdebruine commented 7 months ago

@stephaniehicks thanks for the great question! Coming from you I'm sure you've thought this through yourself, so I would love to get some feedback... Sorry for any confusion, in my experience it is by far the best to use normalized counts for the vast majority of applications.

There are at least two reasons to normalize:

There are many forms of normalization that are all very similar in their ability to create a data representation that yields more meaningful and informative NMF models. However, one normalization I steer away from is SCTransform because it makes a priori assumptions about the data based on linear models and adds residuals back to the data, so why not let NMF handle that job of missing value imputation and denoising?

RcppML::nmf is run on whatever you want. The underlying algorithm is the same as singlet::nmf, but the preprocessing steps for meaningful single-cell analysis may differ from the general case, thus the specific recommendation in singlet::nmf to log-normalize first.

boyiguo1 commented 7 months ago

Thanks for the answers. Just a follow-up question on transformation - for certain transformation, it will map the the domain from nonnegative to nature number, and hence, possibly including negative value, e.g. log transformation of normalized counts with a pseudocount smaller than 1.

Implementation wise, is there a check or mechanism to prevent/correct the negative values? Or any recommendation how to deal with these situation?

Thank you!

zdebruine commented 7 months ago

@boyiguo1 NMF expects non-negative input, and thus any data preprocessing prior to NMF should preserve non-negativity of the input.

NMF is also interpretable, and count data remains interpretable so long as we do not do things like centering/scaling. As soon as we introduce negativity into a transformation, the data usually becomes uninterpretable, and thus the NMF reduction is also uninterpretable.

For ADT assays I know some of the more popular methods for normalization introduce negative values. I admit I'm not clear on this -- ADT is a very noisy assay, so it is possible to estimate negative signal particularly for non-specific antibody binding. However, why not let NMF decide whether a pattern is robust, let it denoise it, and if it is non-specific NMF will pull it out in a factor with other non-specific ADTs?

Algorithmically, negative values in the input data just make it that much more likely that the model will be zero at the corresponding W and H indices... not a good thing.

You mention log transformation of psueodcounts less than 1? Why not use log1p? There is no difference in practice.

boyiguo1 commented 7 months ago

Got it. The explanation makes sense.

Thanks!