QUMI normalization - Githubissues

igordot commented 4 years ago

I am trying to understand how to best deal with the qUMI output. According the quminorm() docs, "the resulting QUMI counts can be analyzed as if they were UMI counts".

However, I was a bit concerned. In the publication, you wrote: "Since neither QUMI nor census normalization of TPM values removes cell-to-cell variation in total counts, we divided the normalized counts by the total counts of each cell, then multiplied all values by the median of the total count distribution across cells. This ensured all cells had the same total counts. We then centered each feature (gene) to have zero mean and scaled to have unit standard deviation prior to running PCA". Traditionally, after normalization to total counts, counts are log-transformed before scaling and dimensionality reduction. Is log-transformation implied or is that not recommended with qUMIs?

Additionally, do you see any concerns with using qUMIs for UMI-based normalization like sctransform?

willtownes commented 4 years ago

Hi, thanks for your interest. Basically the idea is, whatever pipeline you would normally use with UMI counts can also be used with QUMI counts. In the paper, we had to compare everything with PCA because we needed to compare against normalizations like TPM that don't seek to approximate the UMI sampling distribution. Our view is that log transformation is a bad idea when the counts are small which is why we just applied PCA to the relative abundances directly in the paper, even though it's still not ideal.

For analyzing QUMI counts or UMI counts, we recommend using either GLM-PCA, or an approximation to it such as binomial or Poisson null residuals from scry or negative binomial null residuals from sctransform. If you decide to use GLM-PCA please note that version 0.2.0 from github is faster and more scalable to large data than the current CRAN release (0.1.0). I'm hoping to have 0.2.0 out on cran soon but it's not quite there yet.

igordot commented 4 years ago

Thanks for clarifying. I was mostly concerned that QUMIs still should not be treated like UMIs in certain cases.

I think GLM-PCA is very promising and a great solution to the log-transformation problem, but dimensionality reduction is just one aspect of the analysis. You will also probably want to get normalized values to get expression levels of specific genes across sub-populations.

In the paper you reference, you used TPMs/CPMs. Wouldn't the exaggeration between 0 and non-0 be much less with a smaller normalization factor like 10,000 or median library size (which is often close to 10,000)?

I have personally been concerned about the log(x+1) approach that is so prevalent, but as Aaron Lun wrote:

if we put aside theoretical arguments, the widespread use of the log-transformation “in the wild” reflects its adequacy and reliability for most analysts

willtownes commented 4 years ago

All good points. If you want to normalize (Q)UMI counts without doing dimension reduction, you may be interested in the "sanity" approach by Breda et al, or as previously mentioned, a null residuals approach like scry or sctransform.

Yes it's true that counts per 10,000 have fewer problems than CPM/TPM (biorxiv discussion thread). In fact, counts per "1" (relative abundances) have even fewer distortions than that. But the smaller the multiplier, the more the log(1+x) behaves like a linear transformation, which, if you are going to center and scale after that, becomes basically equivalent to doing PCA on centered and scaled relative abundances directly. See "Mathematics of distortion from log-normalizing UMIs" (scroll down from Methods).

I agree with Aaron's comment about log transform being fine "in the wild" under the scenario that the counts are large and there are not many zeros (eg bulk RNA-seq or maybe some single cell protocols with high capture efficiency like smart-seq3). This is exactly the case where a Poisson/ multinomial/ negative binomial would be well-approximated by a normal or lognormal distribution. However, if >90% of your data are zeros, log(1+x) introduces a huge systematic bias that's pretty unnecessary now that we have alternatives based on discrete distributions.

igordot commented 4 years ago

Thank you for the excellent explanation. Some very good points here.

willtownes / quminorm

QUMI normalization #6