How to choose the number of principal components in GLM-PCA?

YushaLiu commented 3 years ago

Hi Will, Do you have any suggestions on how to choose the number of principal components in GLM-PCA? Is there a way to quantify the contributions of each PC similar to the proportion of variance explained in PCA? Thanks!

willtownes commented 3 years ago

Hi Yusha, thanks for your question. In PCA this is commonly done by plotting the "variance explained" of each component (ie a scree plot). Due to the link function in GLM-PCA we can't exactly call the variance of each component "variance explained", but you could still use it to examine their importance as a function of dimensionality. Since both PCA and GLM-PCA return the components in decreasing order of variance, you could make a similar plot (x-axis: dimension index, y-axis: standard deviation of corresponding column in the factors matrix). Note this is possible because GLM-PCA automatically post-processes the model fit to make the loadings orthonormal, without this step the variance of the factors is not interpretable.

YushaLiu commented 3 years ago

Hi Will, thanks for your response -- very helpful! I think the column-wise variances of the factor matrix quantify the relative importance of the factors, but is there a way to quantify the contributions of each factor on an absolute scale (similar to the PVE of each factor that explains the variance existing in the observed data matrix in the gaussian case)? Or more specifically, if I choose L=2 and run GLM-PCA, how do I know if these 2 factors are really useful for capturing the variation in the single cell count data or not? Thanks!

willtownes commented 3 years ago

Yes, that is a great idea but not something I have figured out, perhaps an open topic for a research paper? The closest thing I have heard of is "deviance explained" from pipecomp. You could compare the deviance of a fitted GLM-PCA model to the deviance of a null model with only an intercept term (this would have a closed-form solution) as an absolute goodness-of-fit metric. The difficulty is I don't think you can add and drop individual factors because the optimal GLM-PCA solution for L=2 is unlikely to be the same as the optimal solution for L=3 with the third factor dropped. Rather, you would have to re-fit the model for each value of L. As an approximate alternative, I suppose you could also try just doing PCA on residuals like we implemented with the scry package.

YushaLiu commented 3 years ago

I see. Thanks very much for your explanations and suggestions!

YushaLiu commented 3 years ago

Hi Will, I have a follow-up question. in the "Deviance residuals provide fast approximation to GLM-PCA" section in your paper, I saw that you proposed running plain PCA on the multinomial residuals under the null model as a fast approximation to GLM-PCA. Is that implemented in scry package? If so, does it also allow adjustment of covariates (e.g., batches, cell cycles) in the calculation of multinomial residuals? Thanks so much!

willtownes commented 3 years ago

Yes, the null residuals is fully implemented in the scry package and should also work for disk-based (HDF5) or sparse matrices. It only handles categorical covariates though, since anything more complex would not have a closed-form solution (you could easily implement it yourself though, just run a separate Poisson regression for each gene). Here's a related comment from scry github. Once you have the residuals matrix, you can just pass it to your favorite PCA implementation (eg prcomp for a smaller dataset, or BiocSingular for larger ones)

YushaLiu commented 3 years ago

Thanks very much! That should work since I'm just trying to adjust for categorical covariates :)

willtownes / glmpca

How to choose the number of principal components in GLM-PCA? #32