Clarification on input for NMF -- raw counts or normalized data?

stephaniehicks commented 8 months ago

Hi,

Thank you for this great package! I wanted to ask about the first paragraph here (https://zdebruine.github.io/singlet/articles/Guided_Clustering_with_NMF.html#run-nmf)

NMF can be run on all features using normalized counts. 
Here we apply standard log-normalization, which works very well, but any form of approximate variance stabilizing transformation is suitable for helping NMF find meaningful solutions. 
Raw counts are not suitable for NMF because the model pays too much attention to features with very high counts.

I noticed that in singlet the documentation says NMF should be run on normalized counts, not raw counts. But, my understanding of NMF (and the underlying funciton RcppML::nmf()) is run on the raw counts (https://github.com/zdebruine/RcppML?tab=readme-ov-file#r-functions).

Can you clarify why the normalized counts are appropriate for NMF? Thanks!

zdebruine commented 7 months ago

@stephaniehicks thanks for the great question! Coming from you I'm sure you've thought this through yourself, so I would love to get some feedback... Sorry for any confusion, in my experience it is by far the best to use normalized counts for the vast majority of applications.

There are at least two reasons to normalize:

Equal weighting of features: NMF minimizes the Frobenius norm (MSE loss) of the difference between the reconstruction and input, so we are penalizing the model in squared space. Single-cell transcriptomics data follows a zero-inflated negative binomial (Z-INB) distribution, where most features contain very low count values, but there is a long tail of very large counts in very few features. In Euclidean space, this long and large-valued tail dominates the loss objective, particularly because loss values are squared. Thus, we either need to use KL divergence (which RcppML::nmf and singlet::nmf does not currently support), or some form of normalization. In practice, we see little difference between KL-NMF on counts data or MSE-NMF on log-normalized data. These insights hopefully to come in a publication soon once we get a faster KL-NMF into singlet. If we do not log-normalize, highly expressed genes dominate the model and the revealable rank is much lower, and the factors much sparser, than if we were to normalize. There is a lot of interesting biology in low-expressed genes that only comes out with log-normalization, or similar.
Equal weighting of samples: Regardless of the loss function, samples with greater overall counts (i.e. naive cells vs. platelets) get greater overall representation in the model, which is not generally desirable when the intended purpose of the model is to learn gene co-expression signatures or cell type classification. Standard log-normalization for single-cell data involves a column-wise (sample-wise) unit normalization step that overcomes this issue.

There are many forms of normalization that are all very similar in their ability to create a data representation that yields more meaningful and informative NMF models. However, one normalization I steer away from is SCTransform because it makes a priori assumptions about the data based on linear models and adds residuals back to the data, so why not let NMF handle that job of missing value imputation and denoising?

RcppML::nmf is run on whatever you want. The underlying algorithm is the same as singlet::nmf, but the preprocessing steps for meaningful single-cell analysis may differ from the general case, thus the specific recommendation in singlet::nmf to log-normalize first.

boyiguo1 commented 7 months ago

Thanks for the answers. Just a follow-up question on transformation - for certain transformation, it will map the the domain from nonnegative to nature number, and hence, possibly including negative value, e.g. log transformation of normalized counts with a pseudocount smaller than 1.

Implementation wise, is there a check or mechanism to prevent/correct the negative values? Or any recommendation how to deal with these situation?

Thank you!

zdebruine commented 7 months ago

@boyiguo1 NMF expects non-negative input, and thus any data preprocessing prior to NMF should preserve non-negativity of the input.

NMF is also interpretable, and count data remains interpretable so long as we do not do things like centering/scaling. As soon as we introduce negativity into a transformation, the data usually becomes uninterpretable, and thus the NMF reduction is also uninterpretable.

For ADT assays I know some of the more popular methods for normalization introduce negative values. I admit I'm not clear on this -- ADT is a very noisy assay, so it is possible to estimate negative signal particularly for non-specific antibody binding. However, why not let NMF decide whether a pattern is robust, let it denoise it, and if it is non-specific NMF will pull it out in a factor with other non-specific ADTs?

Algorithmically, negative values in the input data just make it that much more likely that the model will be zero at the corresponding W and H indices... not a good thing.

You mention log transformation of psueodcounts less than 1? Why not use log1p? There is no difference in practice.

boyiguo1 commented 7 months ago

Got it. The explanation makes sense.

Thanks!

zdebruine / singlet

Clarification on input for NMF -- raw counts or normalized data? #51