Closed stephaniehicks closed 7 months ago
@stephaniehicks thanks for the great question! Coming from you I'm sure you've thought this through yourself, so I would love to get some feedback... Sorry for any confusion, in my experience it is by far the best to use normalized counts for the vast majority of applications.
There are at least two reasons to normalize:
RcppML::nmf
and singlet::nmf
does not currently support), or some form of normalization. In practice, we see little difference between KL-NMF on counts data or MSE-NMF on log-normalized data. These insights hopefully to come in a publication soon once we get a faster KL-NMF into singlet
. If we do not log-normalize, highly expressed genes dominate the model and the revealable rank is much lower, and the factors much sparser, than if we were to normalize. There is a lot of interesting biology in low-expressed genes that only comes out with log-normalization, or similar.There are many forms of normalization that are all very similar in their ability to create a data representation that yields more meaningful and informative NMF models. However, one normalization I steer away from is SCTransform because it makes a priori assumptions about the data based on linear models and adds residuals back to the data, so why not let NMF handle that job of missing value imputation and denoising?
RcppML::nmf
is run on whatever you want. The underlying algorithm is the same as singlet::nmf
, but the preprocessing steps for meaningful single-cell analysis may differ from the general case, thus the specific recommendation in singlet::nmf
to log-normalize first.
Thanks for the answers. Just a follow-up question on transformation - for certain transformation, it will map the the domain from nonnegative to nature number, and hence, possibly including negative value, e.g. log transformation of normalized counts with a pseudocount smaller than 1.
Implementation wise, is there a check or mechanism to prevent/correct the negative values? Or any recommendation how to deal with these situation?
Thank you!
@boyiguo1 NMF expects non-negative input, and thus any data preprocessing prior to NMF should preserve non-negativity of the input.
NMF is also interpretable, and count data remains interpretable so long as we do not do things like centering/scaling. As soon as we introduce negativity into a transformation, the data usually becomes uninterpretable, and thus the NMF reduction is also uninterpretable.
For ADT assays I know some of the more popular methods for normalization introduce negative values. I admit I'm not clear on this -- ADT is a very noisy assay, so it is possible to estimate negative signal particularly for non-specific antibody binding. However, why not let NMF decide whether a pattern is robust, let it denoise it, and if it is non-specific NMF will pull it out in a factor with other non-specific ADTs?
Algorithmically, negative values in the input data just make it that much more likely that the model will be zero at the corresponding W and H indices... not a good thing.
You mention log transformation of psueodcounts less than 1? Why not use log1p? There is no difference in practice.
Got it. The explanation makes sense.
Thanks!
Hi,
Thank you for this great package! I wanted to ask about the first paragraph here (https://zdebruine.github.io/singlet/articles/Guided_Clustering_with_NMF.html#run-nmf)
I noticed that in
singlet
the documentation says NMF should be run on normalized counts, not raw counts. But, my understanding of NMF (and the underlying funcitonRcppML::nmf()
) is run on the raw counts (https://github.com/zdebruine/RcppML?tab=readme-ov-file#r-functions).Can you clarify why the normalized counts are appropriate for NMF? Thanks!